PCA Based on Mutual Information for Acoustic Environment Classification

PCA Based on Mutual Information for Acoustic Environment Classification Xueli FAN, Haihong FENG, Meng YUAN Shanghai Acoustics Laboratory, Chinese Academy of Sciences, Shanghai, China [email protected] Abstract needs of computation and data. On the other hand, redundancy among features will bring in false Principal Component Analysis (PCA) is a common information, increase computational complexity and method for feature selection. In order to enhance the cause over-fitting of learning methods. effect of selection, a Principal Component Analysis Principal Component Analysis (PCA)[10] is widely based on Mutual Information (PCAMI) algorithm is used for feature selection. PCA is to reduce the proposed. PCAMI introduces the category information, dimensionality of a data set consisting of a large and uses the sum of mutual information matrix between number of interrelated variables, while retaining as features under different acoustic environments instead much as possible of the variation present in the data of covariance matrix. The eigenvectors of the matrix set. This is achieved by transforming to a new set of represent the transformation coefficients. The variables, the principal components (PCs). PCs are eigenvalues of the matrix are used to calculate the uncorrelated and ordered with variance contribution of cumulative contribution rate to determine the number data so that the first few retain most of the variation of dimension. The experiment on acoustic environment present in all of the original variables. classification shows that PCAMI has better However, for the classification problems, PCA do dimensionality reduction results and higher not use the category information, which means in a classification accuracy using neuron network than dataset PCA obtains the same results no matter how PCA. many classes there are or no matter how to classify the data. The direction of maximum variance is got by 1. Introduction treating all data as one class, in which the difference among classes probably is not the largest. Signal processing technology pays more and more Shannon’s information theory[11] is more attention to the different acoustic environment (i.e. appropriate for classification, including information sound scene), which sets different parameters or entropy, mutual information, conditional mutual chooses variety strategies for different sound scene[1- information and so on, which provides a suitable 4]. Automatic identification of current acoustic formalism for quantifying the uncertainty of the event environment is important and has developed greatly in or arbitrary relationship between variables. many fields, such as the automatic program selection Consequently, the idea introducing information theory for digital hearing aids[5-7], which can identify the to feature selection field has become a hotspot in recent users’ environment and adjust signal processing years [12-14]. strategies or parameters automatically without This paper brings mutual information into PCA for switching by hand. acoustic environment classification. Instead of Acoustic environment classification as a common maximizing variance, the algorithm proposed application of pattern recognition[8] has two main calculates the mutual information between features functional blocks: feature selection and classifier. The under each class and sums them up, which takes class first one selects the most characterizing features and into account and considers the redundancy in features. the second one identifies which class the features It has been proved that this method yields more belong to. In many cases, feature selection is a critical appropriate PCs than PCA, and has better effect on problem to recognize the acoustic environment. On the dimension reduction. Using Neural Network as one hand, high dimension of feature will lead to “curse classifier in acoustic environment classification of dimensionality” problem[9]: approaches that are experiment, accuracy of correct classified is higher suitable for a low pattern dimensionality may become with our algorithm than PCA. unworkable for large dimension because of unrealistic In the following section the shortcoming of PCA is analyzed and an improved version of PCA is proposed. 978-1-4673-0174-9/12/$31.00 ©2012 IEEE 270 ICALIP2012 In section3, the proposed algorithm is applied to rate of PC1 is 91.2%, that is, the rate of largest acoustic environment classification problem to show eigenvalue λ of covariance matrix Σ of X to the sum of the effectiveness of the proposed method. And finally two eigenvalues is 91.2%. Source data and PC1 are conclusions follow in section 4. shown in Figure 1 and the length of PC1 is 15 times than original to show it clearly. 2. Formatting your paper In this section we will present PCA briefly and illustrate the limitation. Then a Principal Component Analysis based on Mutual Information (PCAMI) algorithm for feature selection is proposed. 2.1 PCA Suppose that x is a vector of p random variables, and PCA is interested in the variances of the p random variables and the structure of the covariance or Figure 1. Direction of the PC1 of PCA from correlations between the p variables. To get the PCs, Gaussian data the first step is to look for a linear function α1’x of the elements of x having maximum variance, where α1 is a It can be seen from the figure 1 that the variance of vector of p constants α11, α12, ···, α1p. Next, look for a data in the direction of vertical axis (feature1) is larger linear function α2’x, uncorrelated with α1’x having than that of horizontal axis (feature2), which the PC1 maximum variance, and so on, so that at the k-th stage from PCA conforms to. However, we cannot separate a linear function αk’x is found that has maximum the data from class1 and the data from class2 in the variance subject to being uncorrelated with α1’x, direction of PC1 due to the data mapped to the α2’x, ···, αk-1’x. The k-th derived variable αk’x is the k-th direction of PC1 are totally confused. However, if the PC. Consider the vector of random variables x has a same dataset is divided into two new classes: the data known covariance matrix Σ, which is the matrix whose of feature2 > 0 as class1 and the data of feature2 ≤ 0 as (i, j)-th element is the covariance between the i-th and class2, we also get the same PC1 by PCA. In this case, j-th elements of x when i ≠ j, and the variance of the j- we can identify these two classes in the direction of th element of x when i = j. It turns out that for k =1, PC1. So, the class information is important to get the 2, ···, p, the k-th PC is given by zk = αk’x where αk is an suitable PCs for classification and unfortunately PCA eigenvector of Σ corresponding to its k-th largest does not consider it. On the other side, variance is not eigenvalue λ. Furthermore, if αk is chosen to have unit the unique and apposite rule to get the suitable PCs. length (αk’ αk = 1), then var(zk) = λ , where var(zk) denotes the variance of zk. 2.3 PCA based on mutual information for Feature Selection 2.2 Limitation of PCA Mutual information measures a general dependence PCA considers all data in one class and finds the between two variables. The measurement is not just direction in which the distribution of the data is most appropriate for the random variables of linear disperse, that is, the variance of the data is largest, and correlation but also for the random variables of does not use the category information. Supposing two nonlinear correlation. If p(f1)，p(f2) are the marginal datasets with same data but different division, the PCs probability distribution of random variable F1 and F2 got from PCA are identical. These results are obvious respectively, and p(f1, f2) is the joint probability not appropriate. The method using covariance does not distribution, then the mutual information I(F1; F2) of consider the relationship between the feature and class them is: or the features within one class, which are important to identify the most characterizing features. () pf,f12 Assume that a two-dimension dataset X consists of IFF();log.= ∑∑ pf,f() (1) 12 ff 12 pf()() pf 200 data in two classes and each class has 100 data of 12 12 Gaussian distribution, whose average value and covariance matrix are [1,3]，[1,0;0,50] and [5,3]， It is symmetric in F1 and F2 and always nonnegative. [1,0;0,50] respectively. After PCA, the contribution The mutual information is equal to zero if and only if 271 F1 is independent on F2. So we can calculate the important PC corresponded to is. For example, the mutual information between features to evaluate the eigenvector β1 corresponding to the largest eigenvalue relationship between them. Considering the importance μ 1 multiplied by x is PC1, and the eigenvector of categories, conditional mutual information I(F1; corresponding to the k-th largest eigenvalue βk F2|C) is used to show the relationship between multiplied by x is k-th PC. Second calculate the features under each class. contribution rate θ of each PC, which is defined as the A new algorithm of Principle Component Analysis rate of eigenvalue corresponding to each PC to the sum based on Mutual Information (PCAMI) for feature of all eigenvalues: selection is proposed for improving the performance of PCA. PCAMI introduces the class factor and mutual μ θ = k information to PCA, which uses the sum of conditional k p . (4) mutual information matrix among the features when μ ∑ k class is given instead of covariance matrix in PCA. The k =1 PCs are obtained from the equation below: Then obtain the cumulative contribution δ, which is BB′Ψ=Λ, (2) defined as the sum of contribution rate.

PCA Based on Mutual Information for Acoustic Environment Classification

Distribution of Mutual Information

Lecture 3: Entropy, Relative Entropy, and Mutual Information 1 Notation 2

A Statistical Framework for Neuroimaging Data Analysis Based on Mutual Information Estimated Via a Gaussian Copula

Information Theory 1 Entropy 2 Mutual Information

Fundamentals of Principal Component Analysis (PCA), Independent Component Analysis (ICA), and Independent Vector Analysis (IVA)

On the Estimation of Mutual Information

Weighted Mutual Information for Aggregated Kernel Clustering

Unsupervised Discovery of Temporal Structure in Noisy Data with Dynamical Components Analysis

Contents SYSTEMS GROUP

Lecture 2 Entropy and Mutual Information Instructor: Yichen Wang Ph.D./Associate Professor

Estimating the Mutual Information Between Two Discrete, Asymmetric Variables with Limited Samples

Relation Between the Kantorovich-Wasserstein Metric and the Kullback-Leibler Divergence