Fundamentals of Principal Component Analysis (PCA), Independent Component Analysis (ICA), and Independent Vector Analysis (IVA)

Total Page:16

File Type:pdf, Size:1020Kb

Fundamentals of Principal Component Analysis (PCA), Independent Component Analysis (ICA), and Independent Vector Analysis (IVA) Advanced Signal Processing Group Fundamentals of Principal Component Analysis (PCA), Independent Component Analysis (ICA), and Independent Vector Analysis (IVA) Dr Mohsen Naqvi School of Electrical and Electronic Engineering, Newcastle University [email protected] UDRC Summer School Advanced Signal Processing Group Acknowledgement and Recommended text: Hyvarinen et al., Independent Component Analysis, Wiley, 2001 UDRC Summer School 2 Advanced Signal Processing Group Source Separation and Component Analysis Mixing Process Un-mixing Process 푥 푠1 1 푦1 H W Independent? 푠푛 푥푚 푦푛 Unknown Known Optimize Mixing Model: x = Hs Un-mixing Model: y = Wx = WHs =PDs UDRC Summer School 3 Advanced Signal Processing Group Principal Component Analysis (PCA) Orthogonal projection of data onto lower-dimension linear space: i. maximizes variance of projected data ii. minimizes mean squared distance between data point and projections UDRC Summer School 4 Advanced Signal Processing Group Principal Component Analysis (PCA) Example: 2-D Gaussian Data UDRC Summer School 5 Advanced Signal Processing Group Principal Component Analysis (PCA) Example: 1st PC in the direction of the largest variance UDRC Summer School 6 Advanced Signal Processing Group Principal Component Analysis (PCA) Example: 2nd PC in the direction of the second largest variance and each PC is orthogonal to other components. UDRC Summer School 7 Advanced Signal Processing Group Principal Component Analysis (PCA) • In PCA the redundancy is measured by correlation between data elements • Using only the correlations as in PCA has the advantage that the analysis can be based on second-order statistics (SOS) In the PCA, first the data is centered by subtracting the mean x = x − 퐸{x} UDRC Summer School 8 Advanced Signal Processing Group Principal Component Analysis (PCA) The data x is linearly transformed 푛 푇 푦1 = 푤푗1푥푗 = w1 x 푗=1 subject to w1 =1 Mean of the first transformed factor 푇 푇 퐸 푦1 = 퐸 w1 x = w1 퐸 x = 0 Variance of the first transformed factor 2 푇 2 푇 푇 퐸 푦1 = 퐸 (w1 x) = 퐸 (w1 x)(x w1 ) UDRC Summer School 9 Advanced Signal Processing Group Principal Component Analysis (PCA) 2 푇 푇 푇 푇 퐸 푦1 = 퐸 (w1 x)(x w1 ) = w1 퐸 xx w1 푇 = w1 Rxx w1 (1) The above correlation matrix is symmetric 푇 푇 a Rxx b = b Rxx a Correlation matrix Rxx is not in our control and depends on the data x. We can control w? UDRC Summer School 10 Advanced Signal Processing Group Principal Component Analysis (PCA) (PCA by Variance Maximization) At maximum variance a small change in w will not affect the variance 2 푇 퐸 푦1 = w1 + ∆w Rxx w1 + ∆w 푇 푇 푇 = w1 Rxx w1 + 2∆w Rxx w1 + ∆w Rxx ∆w where ∆w is very small quantity and therefore 2 푇 푇 퐸 푦1 = w1 Rxx w1 + 2∆w Rxx w1 (2) By using (1) and (2) we can write 푇 ∆w Rxx w1 = 0 (3) UDRC Summer School 11 Advanced Signal Processing Group Principal Component Analysis (PCA) As we know that w1 + ∆w = 1 therefore 푇 w1+ ∆w w1+∆w = 1 푇 푇 푇 w1 w 1 + 2∆w w1 + ∆w ∆w = 1 where ∆w is very small quantity and therefore 푇 1 + 2∆w w1 + 0 = 1 푇 ∆w w1 = 0 (4) The above result shows that w1 and ∆w are orthogonal to each other. UDRC Summer School 12 Advanced Signal Processing Group Principal Component Analysis (PCA) By careful comparison of (3) and (4) and by considering Rxx 푇 푇 ∆w Rxx w1 − ∆w 휆1w1 = 0 푇 ∆w Rxx w1 − 휆1w1 = 0 since ∆w푇 ≠ 0 ∴ Rxx w1 − 휆1w1 = 0 → Rxx w푖 = 휆푖w푖 푖 =1, 2, … , 푛 (5) where 휆1, 휆2, … … … … , 휆푛 are eigenvalues of Rxx and w1,w2, … … … … , w푛 are eigenvectors of Rxx UDRC Summer School 13 Advanced Signal Processing Group Principal Component Analysis (PCA) Rxx w푖 = 휆푖w푖 , 푖 =1, 2, … , 푛 휆1= 휆푚푎푥 > 휆2 > … … … > 휆푛 and E = w1, w2, … … … , w푛 ∴ RxxE = E Λ (6) where Λ = dig 휆1, 휆2, … … … , 휆푛 We know that E is orthogonal matrix E푇 E = Ι and we can update (7) 푇 푇 E RxxE = E E Λ = ΙΛ= Λ (7) UDRC Summer School 14 Advanced Signal Processing Group Principal Component Analysis (PCA) In expanded form (7) is 휆 푖 = 푗 w푇R w = 푖 푖 xx 푗 0 푖 ≠ 푗 Rxx is symmetric and positive definite matrix and can be represented as [Hyvarinen et al.] 푛 푇 Rxx = 푖=1 휆푖 w푖 w푗 We know 1 푖 = 푗 w푇w = 푖 푗 0 푖 ≠ 푗 Finally, (1) will become 2 푇 퐸 푦푖 = w푖 Rxx w푖 = 휆푖, 푖 =1, 2, … , 푛 8 Can we define the principal components now? UDRC Summer School 15 Advanced Signal Processing Group Principal Component Analysis (PCA) In a multivariate Gaussian probability density, the principal components are in the directions of eigenvectors w푖 and the respective variances are the eigenvalues 휆푖. There are 푛 possible solutions for w and 푇 푇 푦푖 = w푖 x = x w푖 푖 =1, 2, … , 푛 UDRC Summer School 16 Advanced Signal Processing Group Principal Component Analysis (PCA) Whitening Data whitening is another form of PCA The objective of whitening is to transform the observed vector z = Vx, which is uncorrelated and with variance equal to identity matrix 퐸 zz푇 = Ι The unmixing matrix, W, can be decomposed into two components: W = UV Where U is rotation matrix and V is the whitening matrix and V = Λ−1/2 E푇 UDRC Summer School 17 Advanced Signal Processing Group Principal Component Analysis (PCA) Whitening z = Vx 퐸 zz푇 = V퐸 xx푇 V푇 = Λ−1/2 E푇 퐸 xx푇 E Λ−1/2 푇 = Λ−1/2 E RxxE Λ−1/2 = Ι The covariance is unit matrix, therefore z is whitened Whinintng matrix, V, is not unique. It can be pre-multiply by an orthogonal matrix to obtain another version of V Limitations: PCA only deals with second order statistics and provides only decorrelation. Uncorrelated components are not independent UDRC Summer School 18 Advanced Signal Processing Group Principal Component Analysis (PCA) Limitations Two components with uniform distributions and their mixture The joint distribution of ICs 푠1(horizontal axis) and 푠2 The joint distribution of observed mixtures (vertical axis) with uniform distributions. 푥1(horizontal axis) and 푥2 (vertical axis). UDRC Summer School 19 Advanced Signal Processing Group Principal Component Analysis (PCA) Limitations Two components with uniform distributions and after PCA The joint distribution of ICs 푠1(horizontal axis) and 푠2 The joint distribution of the whitened mixtures of (vertical axis) with uniform distributions. uniformly distributed ICs PCA does not find original coordinates Factor rotation problem UDRC Summer School 20 Advanced Signal Processing Group Principal Component Analysis (PCA) Limitations: uncorrelation and independence If two zero mean random variables 푠1 and 푠2 are uncorrelated then their covariance is zero: cov 푠1, 푠2 = 퐸 푠1푠2 = 0 If 푠1 and 푠2 are independent, then for any two functions, 1and 2 we have 퐸 1 푠1 2(푠2) = 퐸 1(푠1) 퐸 2(푠2) If random variables are independent then they are also uncorrelated If 푠1 and 푠2 are discrete valued and follow such a distribution that the pair are with probability 1/4 equal to any of the following values: (0,1), (0,-1), (1,0) and (-1,0). Then 1 퐸 푠2푠2) = 0 ≠ = 퐸 푠2 퐸 푠2 1 2 4 1 2 Therefore, uncorrelated components are not independent UDRC Summer School 21 Advanced Signal Processing Group Principal Component Analysis (PCA) Limitations: uncorrelation and independence However, uncorrelated components of Gaussian distribution are independent Distribution after PCA is same as distribution before mixing UDRC Summer School 22 Advanced Signal Processing Group Independent Component Analysis (ICA) Independent component analysis (ICA) is a statistical and computational technique for revealing hidden factors that underlie sets of random variables, measurements, or signals. ICA separates the sources at each frequency bin UDRC Summer School 23 Advanced Signal Processing Group Independent Component Analysis (ICA) The fundamental restrictions in ICA are: i. The sources are assumed to be statistically independent of each other. Mathematically, independence implies that the joint probability density function p(s) of the sources can be factorized n p(s) pi(si ) i1 where pi(si ) is the marginal distribution of the i - th source ii. All but one of the sources must have non-Gaussian distributions UDRC Summer School 24 Advanced Signal Processing Group Independent Component Analysis (ICA) Why? The joint pdf of two Gaussian ICs, 푠1and 푠2 is: 1 푠2+푠2 1 s 2 푝 푠 , 푠 = exp − 1 2 = exp − 1 2 2휋 2 2휋 2 If the mixing matrix H is orthogonal, the joint pfd of the mixtures 푥1and 푥2 is: 푇 2 1 H s 푝 푥 , 푥 = exp − detH푇 1 2 2휋 2 2 H is orthogonal, therefore H푇s = s 2 and detH푇 = 1 1 s 2 푝 푥 , 푥 = exp − 1 2 2휋 2 Orthogonal mixing matrix does not change the pdf and original and mixing distributions are identical. Therefore, there is no way to infer the mixing matrix from the mixtures UDRC Summer School 25 Advanced Signal Processing Group Independent Component Analysis (ICA) iii. In general, the mixing matrix H is square (푛 = 푚) and invertible iv. Methods to realize ICA are more sensitive to data length than the methods based on SOS Mainly, ICA relies on two steps: A statistical criterion, e.g. nongaussianity measure, expressed in terms of a cost/contrast function 퐽((푦)) An optimization technique to carry out the minimization or maximization of the cost function 퐽((푦)) UDRC Summer School 26 Advanced Signal Processing Group Independent Component Analysis (ICA) 1. Minimization of Mutual Information (Minimization of Kullback-Leibler Divergence) 2.
Recommended publications
  • Distribution of Mutual Information
    Distribution of Mutual Information Marcus Hutter IDSIA, Galleria 2, CH-6928 Manno-Lugano, Switzerland [email protected] http://www.idsia.ch/-marcus Abstract The mutual information of two random variables z and J with joint probabilities {7rij} is commonly used in learning Bayesian nets as well as in many other fields. The chances 7rij are usually estimated by the empirical sampling frequency nij In leading to a point es­ timate J(nij In) for the mutual information. To answer questions like "is J (nij In) consistent with zero?" or "what is the probability that the true mutual information is much larger than the point es­ timate?" one has to go beyond the point estimate. In the Bayesian framework one can answer these questions by utilizing a (second order) prior distribution p( 7r) comprising prior information about 7r. From the prior p(7r) one can compute the posterior p(7rln), from which the distribution p(Iln) of the mutual information can be cal­ culated. We derive reliable and quickly computable approximations for p(Iln). We concentrate on the mean, variance, skewness, and kurtosis, and non-informative priors. For the mean we also give an exact expression. Numerical issues and the range of validity are discussed. 1 Introduction The mutual information J (also called cross entropy) is a widely used information theoretic measure for the stochastic dependency of random variables [CT91, SooOO] . It is used, for instance, in learning Bayesian nets [Bun96, Hec98] , where stochasti­ cally dependent nodes shall be connected. The mutual information defined in (1) can be computed if the joint probabilities {7rij} of the two random variables z and J are known.
    [Show full text]
  • Lecture 3: Entropy, Relative Entropy, and Mutual Information 1 Notation 2
    EE376A/STATS376A Information Theory Lecture 3 - 01/16/2018 Lecture 3: Entropy, Relative Entropy, and Mutual Information Lecturer: Tsachy Weissman Scribe: Yicheng An, Melody Guan, Jacob Rebec, John Sholar In this lecture, we will introduce certain key measures of information, that play crucial roles in theoretical and operational characterizations throughout the course. These include the entropy, the mutual information, and the relative entropy. We will also exhibit some key properties exhibited by these information measures. 1 Notation A quick summary of the notation 1. Discrete Random Variable: U 2. Alphabet: U = fu1; u2; :::; uM g (An alphabet of size M) 3. Specific Value: u; u1; etc. For discrete random variables, we will write (interchangeably) P (U = u), PU (u) or most often just, p(u) Similarly, for a pair of random variables X; Y we write P (X = x j Y = y), PXjY (x j y) or p(x j y) 2 Entropy Definition 1. \Surprise" Function: 1 S(u) log (1) , p(u) A lower probability of u translates to a greater \surprise" that it occurs. Note here that we use log to mean log2 by default, rather than the natural log ln, as is typical in some other contexts. This is true throughout these notes: log is assumed to be log2 unless otherwise indicated. Definition 2. Entropy: Let U a discrete random variable taking values in alphabet U. The entropy of U is given by: 1 X H(U) [S(U)] = log = − log (p(U)) = − p(u) log p(u) (2) , E E p(U) E u Where U represents all u values possible to the variable.
    [Show full text]
  • A Statistical Framework for Neuroimaging Data Analysis Based on Mutual Information Estimated Via a Gaussian Copula
    bioRxiv preprint doi: https://doi.org/10.1101/043745; this version posted October 25, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. A statistical framework for neuroimaging data analysis based on mutual information estimated via a Gaussian copula Robin A. A. Ince1*, Bruno L. Giordano1, Christoph Kayser1, Guillaume A. Rousselet1, Joachim Gross1 and Philippe G. Schyns1 1 Institute of Neuroscience and Psychology, University of Glasgow, 58 Hillhead Street, Glasgow, G12 8QB, UK * Corresponding author: [email protected] +44 7939 203 596 Abstract We begin by reviewing the statistical framework of information theory as applicable to neuroimaging data analysis. A major factor hindering wider adoption of this framework in neuroimaging is the difficulty of estimating information theoretic quantities in practice. We present a novel estimation technique that combines the statistical theory of copulas with the closed form solution for the entropy of Gaussian variables. This results in a general, computationally efficient, flexible, and robust multivariate statistical framework that provides effect sizes on a common meaningful scale, allows for unified treatment of discrete, continuous, uni- and multi-dimensional variables, and enables direct comparisons of representations from behavioral and brain responses across any recording modality. We validate the use of this estimate as a statistical test within a neuroimaging context, considering both discrete stimulus classes and continuous stimulus features. We also present examples of analyses facilitated by these developments, including application of multivariate analyses to MEG planar magnetic field gradients, and pairwise temporal interactions in evoked EEG responses.
    [Show full text]
  • Information Theory 1 Entropy 2 Mutual Information
    CS769 Spring 2010 Advanced Natural Language Processing Information Theory Lecturer: Xiaojin Zhu [email protected] In this lecture we will learn about entropy, mutual information, KL-divergence, etc., which are useful concepts for information processing systems. 1 Entropy Entropy of a discrete distribution p(x) over the event space X is X H(p) = − p(x) log p(x). (1) x∈X When the log has base 2, entropy has unit bits. Properties: H(p) ≥ 0, with equality only if p is deterministic (use the fact 0 log 0 = 0). Entropy is the average number of 0/1 questions needed to describe an outcome from p(x) (the Twenty Questions game). Entropy is a concave function of p. 1 1 1 1 7 For example, let X = {x1, x2, x3, x4} and p(x1) = 2 , p(x2) = 4 , p(x3) = 8 , p(x4) = 8 . H(p) = 4 bits. This definition naturally extends to joint distributions. Assuming (x, y) ∼ p(x, y), X X H(p) = − p(x, y) log p(x, y). (2) x∈X y∈Y We sometimes write H(X) instead of H(p) with the understanding that p is the underlying distribution. The conditional entropy H(Y |X) is the amount of information needed to determine Y , if the other party knows X. X X X H(Y |X) = p(x)H(Y |X = x) = − p(x, y) log p(y|x). (3) x∈X x∈X y∈Y From above, we can derive the chain rule for entropy: H(X1:n) = H(X1) + H(X2|X1) + ..
    [Show full text]
  • On the Estimation of Mutual Information
    proceedings Proceedings On the Estimation of Mutual Information Nicholas Carrara * and Jesse Ernst Physics department, University at Albany, 1400 Washington Ave, Albany, NY 12222, USA; [email protected] * Correspondence: [email protected] † Presented at the 39th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering, Garching, Germany, 30 June–5 July 2019. Published: 15 January 2020 Abstract: In this paper we focus on the estimation of mutual information from finite samples (X × Y). The main concern with estimations of mutual information (MI) is their robustness under the class of transformations for which it remains invariant: i.e., type I (coordinate transformations), III (marginalizations) and special cases of type IV (embeddings, products). Estimators which fail to meet these standards are not robust in their general applicability. Since most machine learning tasks employ transformations which belong to the classes referenced in part I, the mutual information can tell us which transformations are most optimal. There are several classes of estimation methods in the literature, such as non-parametric estimators like the one developed by Kraskov et al., and its improved versions. These estimators are extremely useful, since they rely only on the geometry of the underlying sample, and circumvent estimating the probability distribution itself. We explore the robustness of this family of estimators in the context of our design criteria. Keywords: mutual information; non-parametric entropy estimation; dimension reduction; machine learning 1. Introduction Interpretting mutual information (MI) as a measure of correlation has gained considerable attention over the past couple of decades for it’s application in both machine learning [1–4] and in dimensionality reduction [5], although it has a rich history in Communication Theory, especially in applications of Rate-Distortion theory [6] (While MI has been exploited in these examples, it has only recently been derived from first principles [7]).
    [Show full text]
  • Weighted Mutual Information for Aggregated Kernel Clustering
    entropy Article Weighted Mutual Information for Aggregated Kernel Clustering Nezamoddin N. Kachouie 1,*,†,‡ and Meshal Shutaywi 2,‡ 1 Department of Mathematical Sciences, Florida Institute of Technology, Melbourne, FL 32901, USA 2 Department of Mathematics, King Abdulaziz University, Rabigh 21911, Saudi Arabia; [email protected] * Correspondence: nezamoddin@fit.edu † Current address: 150 W. University Blvd., Melbourne, FL, USA. ‡ These authors contributed equally to this work. Received: 24 January 2020; Accepted: 13 March 2020; Published: 18 March 2020 Abstract: Background: A common task in machine learning is clustering data into different groups based on similarities. Clustering methods can be divided in two groups: linear and nonlinear. A commonly used linear clustering method is K-means. Its extension, kernel K-means, is a non-linear technique that utilizes a kernel function to project the data to a higher dimensional space. The projected data will then be clustered in different groups. Different kernels do not perform similarly when they are applied to different datasets. Methods: A kernel function might be relevant for one application but perform poorly to project data for another application. In turn choosing the right kernel for an arbitrary dataset is a challenging task. To address this challenge, a potential approach is aggregating the clustering results to obtain an impartial clustering result regardless of the selected kernel function. To this end, the main challenge is how to aggregate the clustering results. A potential solution is to combine the clustering results using a weight function. In this work, we introduce Weighted Mutual Information (WMI) for calculating the weights for different clustering methods based on their performance to combine the results.
    [Show full text]
  • Unsupervised Discovery of Temporal Structure in Noisy Data with Dynamical Components Analysis
    Unsupervised Discovery of Temporal Structure in Noisy Data with Dynamical Components Analysis David G. Clark∗;1;2 Jesse A. Livezey∗;2;3 Kristofer E. Bouchard2;3;4 [email protected] [email protected] [email protected] ∗Equal contribution. 1Center for Theoretical Neuroscience, Columbia University 2Biological Systems and Engineering Division, Lawrence Berkeley National Laboratory 3Redwood Center for Theoretical Neuroscience, University of California, Berkeley 4Helen Wills Neuroscience Institute, University of California, Berkeley Abstract Linear dimensionality reduction methods are commonly used to extract low- dimensional structure from high-dimensional data. However, popular methods disregard temporal structure, rendering them prone to extracting noise rather than meaningful dynamics when applied to time series data. At the same time, many successful unsupervised learning methods for temporal, sequential and spatial data extract features which are predictive of their surrounding context. Combining these approaches, we introduce Dynamical Components Analysis (DCA), a linear dimen- sionality reduction method which discovers a subspace of high-dimensional time series data with maximal predictive information, defined as the mutual information between the past and future. We test DCA on synthetic examples and demon- strate its superior ability to extract dynamical structure compared to commonly used linear methods. We also apply DCA to several real-world datasets, showing that the dimensions extracted by DCA are more useful than those extracted by other methods for predicting future states and decoding auxiliary variables. Over- all, DCA robustly extracts dynamical structure in noisy, high-dimensional data while retaining the computational efficiency and geometric interpretability of linear dimensionality reduction methods. 1 Introduction Extracting meaningful structure from noisy, high-dimensional data in an unsupervised manner is a fundamental problem in many domains including neuroscience, physics, econometrics and climatology.
    [Show full text]
  • Contents SYSTEMS GROUP
    DATABASE Contents SYSTEMS GROUP 1) Introduction to clustering 2) Partitioning Methods –K-Means –K-Medoid – Choice of parameters: Initialization, Silhouette coefficient 3) Expectation Maximization: a statistical approach 4) Density-based Methods: DBSCAN 5) Hierarchical Methods – Agglomerative and Divisive Hierarchical Clustering – Density-based hierarchical clustering: OPTICS 6) Evaluation of Clustering Results 7) Further Clustering Topics – Ensemble Clustering Clustering DATABASE Evaluation of Clustering Results SYSTEMS GROUP Ein paar Grundgedanken: – Clustering: Anwendung eines Verfahrens auf ein konkretes Problem (einen bestimmten Datensatz, über den man Neues erfahren möchte). Zur Erinnerung: KDD ist der Prozess der (semi-) automatischen Extraktion von Wissen aus Datenbanken, das gültig, bisher unbekannt und potentiell nützlich ist. – Frage: Was kann man mit den Ergebnissen anfangen? nicht unbedingt neue Erkenntnisse, aber gut überprüfbare Anwendung auf Daten, die bereits gut bekannt sind Anwendung auf künstliche Daten, deren Struktur by design bekannt ist Frage: Werden Eigenschaften/Strukturen gefunden, die der Algorithmus nach seinem Model finden sollte? Besser als andere Algorithmen? Überprüfbarkeit alleine ist aber fragwürdig! – grundsätzlich: Clustering ist unsupervised ein Clustering ist nicht richtig oder falsch, sondern mehr oder weniger sinnvoll ein “sinnvolles” Clustering wird von den verschiedenen Verfahren auf der Grundlage von verschiedenen Annahmen (Heuristiken!) angestrebt Überprüfung der Sinnhaftigkeit erfordert
    [Show full text]
  • Lecture 2 Entropy and Mutual Information Instructor: Yichen Wang Ph.D./Associate Professor
    Elements of Information Theory Lecture 2 Entropy and Mutual Information Instructor: Yichen Wang Ph.D./Associate Professor School of Information and Communications Engineering Division of Electronics and Information Xi’an Jiaotong University Outlines Entropy, Joint Entropy, and Conditional Entropy Mutual Information Convexity Analysis for Entropy and Mutual Information Entropy and Mutual Information in Communications Systems Xi’an Jiaotong University 3 Entropy Entropy —— 熵 Entropy in Thermodynamics(热力学) It was first developed in the early 1850s by Rudolf Clausius (French Physicist). System is composed of a very large number of constituents (atoms, molecule…). It is a measure of the number of the microscopic configurations that corresponds to a thermodynamic system in a state specified by certain macroscopic variables. It can be understood as a measure of molecular disorder within a macroscopic system. Xi’an Jiaotong University 4 Entropy Entropy in Statistical Mechanics(统计力学) The statistical definition was developed by Ludwig Boltzmann in the 1870s by analyzing the statistical behavior of the microscopic components of the system. Boltzmann showed that this definition of entropy was equivalent to the thermodynamic entropy to within a constant number which has since been known as Boltzmann's constant. Entropy is associated with the number of the microstates of a system. How to define ENTROPY in information theory? Xi’an Jiaotong University 5 Entropy Definition The entropy H(X) of a discrete random variable X is defined by p(x) is the probability mass function(概率质量函数)which can be written as The base of the logarithm is 2 and the unit is bits. If the base of the logarithm is e, then the unit is nats.
    [Show full text]
  • Estimating the Mutual Information Between Two Discrete, Asymmetric Variables with Limited Samples
    entropy Article Estimating the Mutual Information between Two Discrete, Asymmetric Variables with Limited Samples Damián G. Hernández * and Inés Samengo Department of Medical Physics, Centro Atómico Bariloche and Instituto Balseiro, 8400 San Carlos de Bariloche, Argentina; [email protected] * Correspondence: [email protected] Received: 3 May 2019; Accepted: 13 June 2019; Published: 25 June 2019 Abstract: Determining the strength of nonlinear, statistical dependencies between two variables is a crucial matter in many research fields. The established measure for quantifying such relations is the mutual information. However, estimating mutual information from limited samples is a challenging task. Since the mutual information is the difference of two entropies, the existing Bayesian estimators of entropy may be used to estimate information. This procedure, however, is still biased in the severely under-sampled regime. Here, we propose an alternative estimator that is applicable to those cases in which the marginal distribution of one of the two variables—the one with minimal entropy—is well sampled. The other variable, as well as the joint and conditional distributions, can be severely undersampled. We obtain a consistent estimator that presents very low bias, outperforming previous methods even when the sampled data contain few coincidences. As with other Bayesian estimators, our proposal focuses on the strength of the interaction between the two variables, without seeking to model the specific way in which they are related. A distinctive property of our method is that the main data statistics determining the amount of mutual information is the inhomogeneity of the conditional distribution of the low-entropy variable in those states in which the large-entropy variable registers coincidences.
    [Show full text]
  • Relation Between the Kantorovich-Wasserstein Metric and the Kullback-Leibler Divergence
    Relation between the Kantorovich-Wasserstein metric and the Kullback-Leibler divergence Roman V. Belavkin Abstract We discuss a relation between the Kantorovich-Wasserstein (KW) metric and the Kullback-Leibler (KL) divergence. The former is defined us- ing the optimal transport problem (OTP) in the Kantorovich formulation. The latter is used to define entropy and mutual information, which appear in variational problems to find optimal channel (OCP) from the rate distortion and the value of information theories. We show that OTP is equivalent to OCP with one additional constraint fixing the output measure, and there- fore OCP with constraints on the KL-divergence gives a lower bound on the KW-metric. The dual formulation of OTP allows us to explore the relation between the KL-divergence and the KW-metric using decomposition of the former based on the law of cosines. This way we show the link between two divergences using the variational and geometric principles. Keywords: Kantorovich metric · Wasserstein metric · Kullback-Leibler di- vergence · Optimal transport · Rate distortion · Value of information 1 Introduction The study of the optimal transport problem (OTP), initiated by Gaspar Monge [9], was advanced greatly when Leonid Kantorovich reformulated the arXiv:1908.09211v1 [cs.IT] 24 Aug 2019 problem in the language of probability theory [7]. Let X and Y be two mea- surable sets, and let P(X) and P(Y ) be the sets of all probability measures on X and Y respectively, and let P(X × Y ) be the set of all joint probability measures on X × Y . Let c : X × Y → R be a non-negative measurable func- tion, which we shall refer to as the cost function.
    [Show full text]
  • Generalized Mutual Information
    Article Generalized Mutual Information Zhiyi Zhang Department of Mathematics and Statistics, University of North Carolina at Charlotte, Charlotte, NC 28223, USA; [email protected] Received: 28 April 2020; Accepted: 7 June 2020; Published: 10 June 2020 Abstract: Mutual information is one of the essential building blocks of information theory. It is however only finitely defined for distributions in a subclass of the general class of all distributions on a joint alphabet. The unboundedness of mutual information prevents its potential utility from being extended to the general class. This is in fact a void in the foundation of information theory that needs to be filled. This article proposes a family of generalized mutual information whose members are indexed by a positive integer n, with the nth member being the mutual information of nth order. The mutual information of the first order coincides with Shannon’s, which may or may not be finite. It is however established (a) that each mutual information of an order greater than 1 is finitely defined for all distributions of two random elements on a joint countable alphabet, and (b) that each and every member of the family enjoys all the utilities of a finite Shannon’s mutual information. Keywords: mutual information; Shannon’s entropy; conditional distribution of total collision; generalized entropy; generalized mutual information MSC: primary 60E10; secondary 94A15, 82B30 1. Introduction and Summary This article proposes a family of generalized mutual information whose members are indexed by a positive integer n, with the nth member being the mutual information of nth order. The mutual information of the first order coincides with Shannon’s, which may or may not be finite.
    [Show full text]