Fundamentals of Principal Component Analysis (PCA), Independent Component Analysis (ICA), and Independent Vector Analysis (IVA)

Advanced Signal Processing Group Fundamentals of Principal Component Analysis (PCA), Independent Component Analysis (ICA), and Independent Vector Analysis (IVA) Dr Mohsen Naqvi School of Electrical and Electronic Engineering, Newcastle University [email protected] UDRC Summer School Advanced Signal Processing Group Acknowledgement and Recommended text: Hyvarinen et al., Independent Component Analysis, Wiley, 2001 UDRC Summer School 2 Advanced Signal Processing Group Source Separation and Component Analysis Mixing Process Un-mixing Process 푥 푠1 1 푦1 H W Independent? 푠푛 푥푚 푦푛 Unknown Known Optimize Mixing Model: x = Hs Un-mixing Model: y = Wx = WHs =PDs UDRC Summer School 3 Advanced Signal Processing Group Principal Component Analysis (PCA) Orthogonal projection of data onto lower-dimension linear space: i. maximizes variance of projected data ii. minimizes mean squared distance between data point and projections UDRC Summer School 4 Advanced Signal Processing Group Principal Component Analysis (PCA) Example: 2-D Gaussian Data UDRC Summer School 5 Advanced Signal Processing Group Principal Component Analysis (PCA) Example: 1st PC in the direction of the largest variance UDRC Summer School 6 Advanced Signal Processing Group Principal Component Analysis (PCA) Example: 2nd PC in the direction of the second largest variance and each PC is orthogonal to other components. UDRC Summer School 7 Advanced Signal Processing Group Principal Component Analysis (PCA) • In PCA the redundancy is measured by correlation between data elements • Using only the correlations as in PCA has the advantage that the analysis can be based on second-order statistics (SOS) In the PCA, first the data is centered by subtracting the mean x = x − 퐸{x} UDRC Summer School 8 Advanced Signal Processing Group Principal Component Analysis (PCA) The data x is linearly transformed 푛 푇 푦1 = 푤푗1푥푗 = w1 x 푗=1 subject to w1 =1 Mean of the first transformed factor 푇 푇 퐸 푦1 = 퐸 w1 x = w1 퐸 x = 0 Variance of the first transformed factor 2 푇 2 푇 푇 퐸 푦1 = 퐸 (w1 x) = 퐸 (w1 x)(x w1 ) UDRC Summer School 9 Advanced Signal Processing Group Principal Component Analysis (PCA) 2 푇 푇 푇 푇 퐸 푦1 = 퐸 (w1 x)(x w1 ) = w1 퐸 xx w1 푇 = w1 Rxx w1 (1) The above correlation matrix is symmetric 푇 푇 a Rxx b = b Rxx a Correlation matrix Rxx is not in our control and depends on the data x. We can control w? UDRC Summer School 10 Advanced Signal Processing Group Principal Component Analysis (PCA) (PCA by Variance Maximization) At maximum variance a small change in w will not affect the variance 2 푇 퐸 푦1 = w1 + ∆w Rxx w1 + ∆w 푇 푇 푇 = w1 Rxx w1 + 2∆w Rxx w1 + ∆w Rxx ∆w where ∆w is very small quantity and therefore 2 푇 푇 퐸 푦1 = w1 Rxx w1 + 2∆w Rxx w1 (2) By using (1) and (2) we can write 푇 ∆w Rxx w1 = 0 (3) UDRC Summer School 11 Advanced Signal Processing Group Principal Component Analysis (PCA) As we know that w1 + ∆w = 1 therefore 푇 w1+ ∆w w1+∆w = 1 푇 푇 푇 w1 w 1 + 2∆w w1 + ∆w ∆w = 1 where ∆w is very small quantity and therefore 푇 1 + 2∆w w1 + 0 = 1 푇 ∆w w1 = 0 (4) The above result shows that w1 and ∆w are orthogonal to each other. UDRC Summer School 12 Advanced Signal Processing Group Principal Component Analysis (PCA) By careful comparison of (3) and (4) and by considering Rxx 푇 푇 ∆w Rxx w1 − ∆w 휆1w1 = 0 푇 ∆w Rxx w1 − 휆1w1 = 0 since ∆w푇 ≠ 0 ∴ Rxx w1 − 휆1w1 = 0 → Rxx w푖 = 휆푖w푖 푖 =1, 2, … , 푛 (5) where 휆1, 휆2, … … … … , 휆푛 are eigenvalues of Rxx and w1,w2, … … … … , w푛 are eigenvectors of Rxx UDRC Summer School 13 Advanced Signal Processing Group Principal Component Analysis (PCA) Rxx w푖 = 휆푖w푖 , 푖 =1, 2, … , 푛 휆1= 휆푚푎푥 > 휆2 > … … … > 휆푛 and E = w1, w2, … … … , w푛 ∴ RxxE = E Λ (6) where Λ = dig 휆1, 휆2, … … … , 휆푛 We know that E is orthogonal matrix E푇 E = Ι and we can update (7) 푇 푇 E RxxE = E E Λ = ΙΛ= Λ (7) UDRC Summer School 14 Advanced Signal Processing Group Principal Component Analysis (PCA) In expanded form (7) is 휆 푖 = 푗 w푇R w = 푖 푖 xx 푗 0 푖 ≠ 푗 Rxx is symmetric and positive definite matrix and can be represented as [Hyvarinen et al.] 푛 푇 Rxx = 푖=1 휆푖 w푖 w푗 We know 1 푖 = 푗 w푇w = 푖 푗 0 푖 ≠ 푗 Finally, (1) will become 2 푇 퐸 푦푖 = w푖 Rxx w푖 = 휆푖, 푖 =1, 2, … , 푛 8 Can we define the principal components now? UDRC Summer School 15 Advanced Signal Processing Group Principal Component Analysis (PCA) In a multivariate Gaussian probability density, the principal components are in the directions of eigenvectors w푖 and the respective variances are the eigenvalues 휆푖. There are 푛 possible solutions for w and 푇 푇 푦푖 = w푖 x = x w푖 푖 =1, 2, … , 푛 UDRC Summer School 16 Advanced Signal Processing Group Principal Component Analysis (PCA) Whitening Data whitening is another form of PCA The objective of whitening is to transform the observed vector z = Vx, which is uncorrelated and with variance equal to identity matrix 퐸 zz푇 = Ι The unmixing matrix, W, can be decomposed into two components: W = UV Where U is rotation matrix and V is the whitening matrix and V = Λ−1/2 E푇 UDRC Summer School 17 Advanced Signal Processing Group Principal Component Analysis (PCA) Whitening z = Vx 퐸 zz푇 = V퐸 xx푇 V푇 = Λ−1/2 E푇 퐸 xx푇 E Λ−1/2 푇 = Λ−1/2 E RxxE Λ−1/2 = Ι The covariance is unit matrix, therefore z is whitened Whinintng matrix, V, is not unique. It can be pre-multiply by an orthogonal matrix to obtain another version of V Limitations: PCA only deals with second order statistics and provides only decorrelation. Uncorrelated components are not independent UDRC Summer School 18 Advanced Signal Processing Group Principal Component Analysis (PCA) Limitations Two components with uniform distributions and their mixture The joint distribution of ICs 푠1(horizontal axis) and 푠2 The joint distribution of observed mixtures (vertical axis) with uniform distributions. 푥1(horizontal axis) and 푥2 (vertical axis). UDRC Summer School 19 Advanced Signal Processing Group Principal Component Analysis (PCA) Limitations Two components with uniform distributions and after PCA The joint distribution of ICs 푠1(horizontal axis) and 푠2 The joint distribution of the whitened mixtures of (vertical axis) with uniform distributions. uniformly distributed ICs PCA does not find original coordinates Factor rotation problem UDRC Summer School 20 Advanced Signal Processing Group Principal Component Analysis (PCA) Limitations: uncorrelation and independence If two zero mean random variables 푠1 and 푠2 are uncorrelated then their covariance is zero: cov 푠1, 푠2 = 퐸 푠1푠2 = 0 If 푠1 and 푠2 are independent, then for any two functions, 1and 2 we have 퐸 1 푠1 2(푠2) = 퐸 1(푠1) 퐸 2(푠2) If random variables are independent then they are also uncorrelated If 푠1 and 푠2 are discrete valued and follow such a distribution that the pair are with probability 1/4 equal to any of the following values: (0,1), (0,-1), (1,0) and (-1,0). Then 1 퐸 푠2푠2) = 0 ≠ = 퐸 푠2 퐸 푠2 1 2 4 1 2 Therefore, uncorrelated components are not independent UDRC Summer School 21 Advanced Signal Processing Group Principal Component Analysis (PCA) Limitations: uncorrelation and independence However, uncorrelated components of Gaussian distribution are independent Distribution after PCA is same as distribution before mixing UDRC Summer School 22 Advanced Signal Processing Group Independent Component Analysis (ICA) Independent component analysis (ICA) is a statistical and computational technique for revealing hidden factors that underlie sets of random variables, measurements, or signals. ICA separates the sources at each frequency bin UDRC Summer School 23 Advanced Signal Processing Group Independent Component Analysis (ICA) The fundamental restrictions in ICA are: i. The sources are assumed to be statistically independent of each other. Mathematically, independence implies that the joint probability density function p(s) of the sources can be factorized n p(s) pi(si ) i1 where pi(si ) is the marginal distribution of the i - th source ii. All but one of the sources must have non-Gaussian distributions UDRC Summer School 24 Advanced Signal Processing Group Independent Component Analysis (ICA) Why? The joint pdf of two Gaussian ICs, 푠1and 푠2 is: 1 푠2+푠2 1 s 2 푝 푠 , 푠 = exp − 1 2 = exp − 1 2 2휋 2 2휋 2 If the mixing matrix H is orthogonal, the joint pfd of the mixtures 푥1and 푥2 is: 푇 2 1 H s 푝 푥 , 푥 = exp − detH푇 1 2 2휋 2 2 H is orthogonal, therefore H푇s = s 2 and detH푇 = 1 1 s 2 푝 푥 , 푥 = exp − 1 2 2휋 2 Orthogonal mixing matrix does not change the pdf and original and mixing distributions are identical. Therefore, there is no way to infer the mixing matrix from the mixtures UDRC Summer School 25 Advanced Signal Processing Group Independent Component Analysis (ICA) iii. In general, the mixing matrix H is square (푛 = 푚) and invertible iv. Methods to realize ICA are more sensitive to data length than the methods based on SOS Mainly, ICA relies on two steps: A statistical criterion, e.g. nongaussianity measure, expressed in terms of a cost/contrast function 퐽((푦)) An optimization technique to carry out the minimization or maximization of the cost function 퐽((푦)) UDRC Summer School 26 Advanced Signal Processing Group Independent Component Analysis (ICA) 1. Minimization of Mutual Information (Minimization of Kullback-Leibler Divergence) 2.

Fundamentals of Principal Component Analysis (PCA), Independent Component Analysis (ICA), and Independent Vector Analysis (IVA)

Distribution of Mutual Information

Lecture 3: Entropy, Relative Entropy, and Mutual Information 1 Notation 2

A Statistical Framework for Neuroimaging Data Analysis Based on Mutual Information Estimated Via a Gaussian Copula

Information Theory 1 Entropy 2 Mutual Information

On the Estimation of Mutual Information

Weighted Mutual Information for Aggregated Kernel Clustering

Unsupervised Discovery of Temporal Structure in Noisy Data with Dynamical Components Analysis

Contents SYSTEMS GROUP

Lecture 2 Entropy and Mutual Information Instructor: Yichen Wang Ph.D./Associate Professor

Estimating the Mutual Information Between Two Discrete, Asymmetric Variables with Limited Samples

Relation Between the Kantorovich-Wasserstein Metric and the Kullback-Leibler Divergence

Generalized Mutual Information