Measuring Statistical Dependence Via the Mutual Information Dimension

Total Page:16

File Type:pdf, Size:1020Kb

Measuring Statistical Dependence Via the Mutual Information Dimension Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence Measuring Statistical Dependence via the Mutual Information Dimension Mahito Sugiyama1 and Karsten M. Borgwardt1;2 1Machine Learning and Computational Biology Research Group, Max Planck Institute for Intelligent Systems and Max Planck Institute for Developmental Biology, Tubingen,¨ Germany 2Zentrum fur¨ Bioinformatik, Eberhard Karls Universitat¨ Tubingen,¨ Germany fmahito.sugiyama, [email protected] Abstract al. [2011] (further analyzed in [Reshef et al., 2013]) that mea- sures any kind of relationships between two continuous vari- We propose to measure statistical dependence be- ables. They use the mutual information obtained by discretiz- tween two random variables by the mutual in- ing data and, intuitively, MIC is the maximum mutual infor- formation dimension (MID), and present a scal- mation across a set of discretization levels. able parameter-free estimation method for this task. However, it has some significant drawbacks: First, MIC Supported by sound dimension theory, our method depends on the input parameter B(n), which is a natural gives an effective solution to the problem of detect- number specifying the maximum size of a grid used for dis- ing interesting relationships of variables in massive cretization of data to obtain the entropy [Reshef et al., 2011, data, which is nowadays a heavily studied topic in SOM 2.2.1]. This means that MIC becomes too small if we many scientific disciplines. Different from classi- choose small B(n) and too large if we choose large B(n). cal Pearson’s correlation coefficient, MID is zero Second, it has high computational cost, as it is exponential if and only if two random variables are statistically with respect to the number of data points1, and not suitable independent and is translation and scaling invari- for large datasets. Third, as pointed out by Simon and Tibshi- ant. We experimentally show superior performance rani [2012], it does not work well for relationship discovery of MID in detecting various types of relationships in the presence of noise. in the presence of noise data. Moreover, we illus- Here we propose to measure dependence between two ran- trate that MID can be effectively used for feature dom variables by the mutual information dimension, or MID, selection in regression. to overcome the above drawbacks of MIC and other machine learning based techniques. First, it contains no parameter 1 Introduction in theory and the estimation method proposed in this paper is also parameter-free. Second, its estimation is fast; the How to measure dependence of variables is a classical yet average-case time complexity is O(n log n), where n is the fundamental problem in statistics. Starting with the Galton’s number of data points. Third, MID is experimentally shown work of Pearson’s correlation coefficient [Stigler, 1989] for to be more robust to uniformly distributed noise data than measuring linear dependence, many techniques have been MIC and other methods. proposed, which are of fundamental importance in scientific The definition of MID is simple: fields such as physics, chemistry, biology, and economics. Machine learning and statistics has defined a number of MID(X; Y ) := dim X + dim Y − dim XY techniques over the last decade which are designed to mea- for two random variables X and Y , where dim X and dim Y [ sure not only linear but also nonlinear dependences Hastie are the information dimension of random variables X and ] [ et al., 2009 . Examples include kernel-based Bach and Y , respectively, and dim XY is that of the joint distribu- ] Jordan, 2003; Gretton et al., 2005 , mutual information- tion of X and Y . The information dimension is one of the [ ] based Kraskov et al., 2004; Steuer et al., 2002 , and distance- fractal dimensions [Ott, 2002] introduced by Renyi´ [1959; [ ] based Szekely´ et al., 2007; Szekely´ and Rizzo, 2009 meth- 1970], and its links to information theory were recently stud- ods. Their main limitation in practice, however, is the lack ied [Wu and Verdu,´ 2010; 2011]. Although MID itself is not of scalability or that one has to specify the type of nonlinear a new concept, this is the first study that introduces MID as a relationship one is interested in beforehand, which requires measure of statistical dependence between random variables; non-trivial parameter selection. Recently, in Science, a distinct method called maximal in- 1Since computing the exact MIC is usually infeasible, they used formation coefficient (MIC) has been proposed by Reshef et heuristic dynamic programming for efficient approximation. 1692 MID = 1 + 1 – 1 = 1 MID = 1 + 1 – 2 = 0 MID = 1 + 0 – 1 = 0 to date, MID has only been used for chaotic time series anal- Y Y Y ysis [Buzug et al., 1994; Prichard and Theiler, 1995]. Projection Projection Projection = 1 = 1 MID has desirable properties as a measure of depen- = 0 Y Y dence: For every pair of random variables X and Y , (1) Y dim dim dim XY = 1 dim MID(X; Y ) = MID(Y ; X) and 0 ≤ MID(X; Y ) ≤ 1; (2) dim XY = 2 dim XY = 1 X X X MID(X; Y ) = 0 if and only if X and Y are statistically inde- Projection Projection Projection pendent (Theorem 1); and (3) MID is invariant with respect dim X = 1 dim X = 1 dim X = 1 to translation and scaling (Theorem 3). Furthermore, MID is related to MIC and can be viewed as an extension of it (see Figure 1: Three intuitive examples. MID is one for linear Section 2.4). relationship (left), and MID is zero for independent relation- To estimate MID from a dataset, we construct an efficient ships (center and right). parameter-free method. Although the general strategy is the same as the standard method used for estimation of the Box- [ ] counting dimension [Falconer, 2003], we aim to remove all Definition 1 (Information Dimension Renyi,´ 1959 ) The parameters from the method using the sliding window strat- information dimension of X is defined as egy, where the width of a window is adaptively determined H(X ) H(X ) dim X := lim k = lim k ; from the number of data points. The average-case and the k!1 − log 2−k k!1 k worst-case complexities of our method are O(n log n) and O(n2) n where H(Xk) denotes the entropy of Xk, defined by with the number of data points, respectively, which P H(Xk) = − pk(x) log pk(x). is much faster than the estimation algorithm of MIC and the x2Z other state-of-the-art methods such as the Hilbert-Schmidt in- The information dimension for a pair of two real-valued dependence criterion (HSIC) [Gretton et al., 2005] and the variables X and Y is naturally defined as dim XY := [ ] P P distance correlation Szekely´ et al., 2007 whose time com- limk!1 H(Xk;Yk)=k, where H(Xk;Yk) = − x2 y2 O(n2) Z Z plexities are . Hence MID scales up to massive datasets pk(x; y) log pk(x; y), the joint entropy of Xk and Yk. Infor- with millions of data points. mally, the information dimension indicates how much a vari- This paper is organized as follows: Section 2 introduces able fills the space, and this property enables us to measure MID and analyzes it theoretically. Section 3 describes a prac- the statistical dependence. Notice that tical estimation method of MID. The experimental results are presented in Section 4, followed by conclusion in Section 5. 0 ≤ dim X ≤ 1; 0 ≤ dim Y ≤ 1; and 0 ≤ dim XY ≤ 2 hold since 0 ≤ H(Xk) ≤ k for each k and 2 Mutual Information Dimension 0 ≤ H(X ;Y ) ≤ H(X ) + H(Y ) ≤ 2k: In fractal and chaos theory, dimension has a crucial role since k k k k it represents the complexity of an object based on a “mea- In this paper, we always assume that dim X and dim Y surement” of it. We employ the information dimension in this exist and X and Y are Borel-measurable. Our formulation paper, which belongs to a larger family of fractal dimensions applies to pairs of continuous random variables X and Y .2 [Falconer, 2003; Ott, 2002]. 2.2 Mutual Information Dimension In the following, let N be the set of natural numbers includ- ing 0, Z the set of of integers, and R the set of real numbers. Based on the information dimension, the mutual information The base of the logarithm is 2 throughout this paper. dimension is defined in an analogous fashion to the mutual We divide the real line R into intervals of the same width information. to obtain the entropy of a discretized variable. Formally, Definition 2 (Mutual Information Dimension) For a pair of random variables X and Y , the mutual information dimen- −k z z + 1 Gk(z) := [z; z + 1) · 2 = x 2 R ≤ x < 2k 2k sion, or MID, is defined as MID(X; Y ) := dim X + dim Y − dim XY: for an integer z 2 Z. We call the resulting system Gk = f Gk(z) j z 2 Z g the partition of R at level k. Partition for 2 We can easily check that MID is also defined as the two-dimensional space is constructed from Gk as G = k I(X ; Y ) f Gk(z1) × Gk(z2) j z1; z2 2 Z g. MID(X; Y ) = lim k k (1) k!1 k 2.1 Information Dimension with the mutual information I(Xk; Yk) of Xk and Yk defined Given a real-valued random variable X, we construct for each P as I(Xk; Yk) = pk(x; y) log(pk(x; y)=pk(x)pk(y)).
Recommended publications
  • Polarization of the Rényi Information Dimension with Applications To
    1 Polarization of the Renyi´ Information Dimension with Applications to Compressed Sensing Saeid Haghighatshoar, Member, IEEE, Emmanuel Abbe, Member, IEEE Abstract—In this paper, we show that the Hadamard matrix in diverse areas in probability theory and signal processing acts as an extractor over the reals of the Renyi´ information such as signal quantization [4], rate-distortion theory [5], and dimension (RID), in an analogous way to how it acts as fractal geometry [6]. More recently, the operational aspect of an extractor of the discrete entropy over finite fields. More precisely, we prove that the RID of an i.i.d. sequence of mixture RID has reappeared in applications as varied as lossless analog random variables polarizes to the extremal values of 0 and 1 compression [7–9], Compressed Sensing of sparse signals [10– (corresponding to discrete and continuous distributions) when 12], and characterization of the degrees-of-freedom (DoFs) of transformed by a Hadamard matrix. Further, we prove that vector interference channels [13, 14], which has recently been the polarization pattern of the RID admits a closed form of significant importance in wireless communication. expression and follows exactly the Binary Erasure Channel (BEC) polarization pattern in the discrete setting. We also extend the In this paper, motivated by [3], we first extend the definition results from the single- to the multi-terminal setting, obtaining a of RID as an information measure from scalar random Slepian-Wolf counterpart of the RID polarization. We discuss variables to a family of vector random variables over which applications of the RID polarization to Compressed Sensing the RID is well-defined.
    [Show full text]
  • Rényi Information Dimension: Fundamental Limits of Almost Lossless Analog Compression Yihong Wu, Student Member, IEEE, and Sergio Verdú, Fellow, IEEE
    IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 8, AUGUST 2010 3721 Rényi Information Dimension: Fundamental Limits of Almost Lossless Analog Compression Yihong Wu, Student Member, IEEE, and Sergio Verdú, Fellow, IEEE Abstract—In Shannon theory, lossless source coding deals with • The central problem is to determine how many compressed the optimal compression of discrete sources. Compressed sensing measurements are sufficient/necessary for recovery with is a lossless coding strategy for analog sources by means of mul- vanishing block error probability as blocklength tends to tiplication by real-valued matrices. In this paper we study almost lossless analog compression for analog memoryless sources in an infinity [2]–[4]. information-theoretic framework, in which the compressor or de- • Random coding is employed to show the existence of compressor is constrained by various regularity conditions, in par- “good” linear encoders. In particular, when the random ticular linearity of the compressor and Lipschitz continuity of the projection matrices follow certain distribution (e.g., stan- decompressor. The fundamental limit is shown to be the informa- dard Gaussian), the restricted isometry property (RIP) is tion dimension proposed by Rényi in 1959. satisfied with overwhelming probability and guarantees Index Terms—Analog compression, compressed sensing, infor- exact recovery. mation measures, Rényi information dimension, Shannon theory, On the other hand, there are also significantly different ingre- source coding. dients in compressed sensing in comparison with information theoretic setups. I. INTRODUCTION • Sources are not modeled probabilistically, and the funda- mental limits are on a worst case basis rather than on av- A. Motivations From Compressed Sensing erage. Moreover, block error probability is with respect to the distribution of the encoding random matrices.
    [Show full text]
  • Understanding the Fractal Dimensions of Urban Forms Through Spatial Entropy
    entropy Article Understanding the Fractal Dimensions of Urban Forms through Spatial Entropy Yanguang Chen 1,* ID , Jiejing Wang 2 and Jian Feng 1 1 Department of Geography, College of Urban and Environmental Sciences, Peking University, Beijing 100871, China; [email protected] 2 Department of Urban Planning and Management, School of Public Administration and Policy, Renmin University of China, Beijing 100872, China; [email protected] * Correspondence: [email protected]; Tel.: +86-106-276-6277 Received: 17 September 2017; Accepted: 6 November 2017; Published: 9 November 2017 Abstract: The spatial patterns and processes of cities can be described with various entropy functions. However, spatial entropy always depends on the scale of measurement, and it is difficult to find a characteristic value for it. In contrast, fractal parameters can be employed to characterize scale-free phenomena and reflect the local features of random multi-scaling structure. This paper is devoted to exploring the similarities and differences between spatial entropy and fractal dimension in urban description. Drawing an analogy between cities and growing fractals, we illustrate the definitions of fractal dimension based on different entropy concepts. Three representative fractal dimensions in the multifractal dimension set, capacity dimension, information dimension, and correlation dimension, are utilized to make empirical analyses of the urban form of two Chinese cities, Beijing and Hangzhou. The results show that the entropy values vary with the measurement scale, but the fractal dimension value is stable is method and study area are fixed; if the linear size of boxes is small enough (e.g., <1/25), the linear correlation between entropy and fractal dimension is significant (based on the confidence level of 99%).
    [Show full text]
  • Information Theory and Dynamical System Predictability
    Entropy 2011, 13, 612-649; doi:10.3390/e13030612 OPEN ACCESS entropy ISSN 1099-4300 www.mdpi.com/journal/entropy Article Information Theory and Dynamical System Predictability Richard Kleeman Courant Institute of Mathematical Sciences, New York University, 251 Mercer Street, New York, NY 10012, USA; E-Mail: [email protected] Received: 25 January 2011; in revised form: 14 February 2011 / Accepted: 20 February 2011 / Published: 7 March 2011 Abstract: Predicting the future state of a turbulent dynamical system such as the atmosphere has been recognized for several decades to be an essentially statistical undertaking. Uncertainties from a variety of sources are magnified by dynamical mechanisms and given sufficient time, compromise any prediction. In the last decade or so this process of uncertainty evolution has been studied using a variety of tools from information theory. These provide both a conceptually general view of the problem as well as a way of probing its non-linearity. Here we review these advances from both a theoretical and practical perspective. Connections with other theoretical areas such as statistical mechanics are emphasized. The importance of obtaining practical results for prediction also guides the development presented. Keywords: predictability; information theory; statistical physics 1. Introduction Prediction within dynamical systems originated within the modern era in the study of the solar system. The regularity of such a system on time scales of centuries meant that very precise predictions of phenomena such as eclipses are possible at such lead times. On longer times scales of order million or more years, chaotic behavior due to the multi-body gravitational interaction becomes evident (see e.g., [1]) and effects such as planetary collisions or system ejection can occur.
    [Show full text]
  • Understanding Fractal Dimension of Urban Form Through Spatial Entropy
    Understanding Fractal Dimension of Urban Form through Spatial Entropy Yanguang Chen1; Jiejing Wang2; Jian Feng1 (1. Department of Geography, College of Urban and Environmental Sciences, Peking University, Beijing 100871, P.R. China. 2. Department of Urban Planning and Management, School of Public Administration and Policy, Renmin University of China, Beijing, P.R. China. E-mail: [email protected]; [email protected]; [email protected]) Abstract: Spatial patterns and processes of cities can be described with various entropy functions. However, spatial entropy always depends on the scale of measurement, and it is difficult to find a characteristic value for it. In contrast, fractal parameters can be employed to characterize scale-free phenomena. This paper is devoted to exploring the similarities and differences between spatial entropy and fractal dimension in urban description. Drawing an analogy between cities and growing fractals, we illustrate the definitions of fractal dimension based on different entropy concepts. Three representative fractal dimensions in the multifractal dimension set are utilized to make empirical analyses of urban form of two cities. The results show that the entropy values are not determinate, but the fractal dimension value is certain; if the linear size of boxes is small enough (e.g., <1/25), the linear correlation between entropy and fractal dimension is clear. Further empirical analysis indicates that fractal dimension is close to the characteristic values of spatial entropy. This suggests that the physical meaning of fractal dimension can be interpreted by the ideas from entropy and scales and the conclusion is revealing for future spatial analysis of cities. Key words: fractal dimension; entropy; mutlifractals; scaling; urban form; Chinese cities 1.
    [Show full text]
  • Information Measures for Deterministic Input-Output Systems
    1 Information Measures for Deterministic Input-Output Systems Bernhard C. Geiger, Student Member, IEEE, and Gernot Kubin, Member, IEEE Abstract—In this work the information loss in deterministic, processing, just as, loosely speaking, a passive system cannot memoryless systems is investigated by evaluating the conditional increase the energy contained in a signal1. entropy of the input random variable given the output random Clearly, the information lost in a system not only depends variable. It is shown that for a large class of systems the infor- mation loss is finite, even if the input is continuously distributed. on the system but also on the signal carrying this information; Based on this finiteness, the problem of perfectly reconstructing the same holds for the energy lost or gained. While there the input is addressed and Fano-type bounds between the is a strong connection between energy and information for information loss and the reconstruction error probability are Gaussian signals (entropy and entropy rate of a Gaussian derived. signal are related to its variance and power spectral density, For systems with infinite information loss a relative measure is defined and shown to be tightly related to Rényi information respectively), for non-Gaussian signals this connection degen- dimension. Employing another Fano-type argument, the recon- erates to a bound by the max-entropy property of the Gaussian struction error probability is bounded by the relative information distribution. Energy and information of a signal therefore can loss from below. behave completely differently when fed through a system. But In view of developing a system theory from an information- while we have a definition of the energy loss (namely, the theoretic point-of-view, the theoretical results are illustrated by inverse gain), an analysis of the information loss in a a few example systems, among them a multi-channel autocorre- L2 lation receiver.
    [Show full text]
  • Information, Measure Shifts and Distribution Diagnostics
    May 2007 Information, measure shifts and distribution diagnostics Roger J. Bowden* Summary Reflexive shifting of a given distribution, using its own distribution function, can reveal information. The shifts are changes in measure accomplished via simple Radon-Nikodym derivatives. The separation of the resulting left and right unit shifted distributions reveals the binary entropy of location. Recombining the shifts via convex combinations yields information-based asymmetry and spread metrics as well as insight into latent structure. The resulting diagnostics are applicable even for long tail densities where distributional moments may not exist, and independently also of benchmarks such as the Gaussian. Key Words: Asymmetry, entropy bandwidth, locational entropy, distribution spread, reflexive shifting, unit distribution shifts. MSC Classifications: 60E05; 94A17. *School of Economics and Finance, Victoria University of Wellington, and Kiwicap Research Ltd. Emails [email protected]; [email protected] . Phone + 64 4 472 4984, fax 4983, postal address School of Economics and Finance, Victoria University of Wellington, P.O. Box 600, Wellington, New Zealand. 1. Introduction The central idea developed in this paper is that shifting a given distribution in a reflexive manner, by using its own distribution function, can reveal information. The resulting operation can be applied in such contexts as latent mixtures to reveal more of the underlying structure. The shift framework can also be useful in originating metrics relating to spread and asymmetry of a distribution. The latter generally assume a benchmark such as the normal distribution, or that symmetry is just a matter of odd-order moments. But in very long tail densities such as the Levy distribution, excess kurtosis and asymmetry metrics based on moments do not even exist.
    [Show full text]
  • Information Dimension and the Probabilistic Structure of Chaos J
    Information Dimension and the Probabilistic Structure of Chaos J. Doyne Farmer Dynamical Systems Collective, Physics Department, UC Santa Cruz, Santa Cruz, California Z. Naturforsch. 37a, 1304-1325 (1982); received May 18, 1982 The concepts of entropy and dimension as applied to dynamical systems are reviewed from a physical point of view. The information dimension, which measures the rate at which the in- formation contained in a probability density scales with resolution, fills a logical gap in the classification of attractors in terms of metric entropy, fractal dimension, and topological entropy. Several examples are presented of chaotic attractors that have a selfsimilar, geometrically scaling structure in their probability distribution; for these attractors the information dimension and fractal dimension are different. Just as the metric (Kolmogorov-Sinai) entropy places an upper bound on the information gained in a sequence of measurements, the information dimension can be used to estimate the information obtained in an isolated measurement. The metric entropy can be expressed in terms of the information dimension of a probability distribution constructed from a sequence of measurements. An algorithm is presented that allows the experimental de- termination of the information dimension and metric entropy. Introduction the natural vehicle to make the notion of "degrees of freedom" more precise. Let us start out with a quotation from Abbot's Following A Square [1], we will distinguish three "Fiatland" [1]: distinct intuitive notions of dimension: "Dimension implies direction, implies measurement, implies the more and the less." (1) Direction: Euclidean 3-space is naturally de- scribed by 3-tuples, and is not homeomorphic to "As one of your Spaceland poets has said, we are all liable to the same errors, all alike the Slaves of our re- Euclidean 2-space.
    [Show full text]
  • Modeling and Analyzing Fractal Point Processes by Anand R
    RICE UNIVERSITY Modeling and Analyzing Fractal Point Processes by Anand R. Kumar A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy Approved, Thesis Committee: Professor of Electrical and Computer Engineering John \V. Clark Professor of Electrical and Computer Engineering r· . i ' .• (// -; ,,<(, T ·.· (I {· ! .••• : /j _/ Katherine B. Ensor Assistant Professor of Statistics Chiyeko Tsuchitani Adjunct Professor of Electrical and Computer Engineeering Houston, Texas April, 1990 3 1272 00637 7038 Modeling and Analyzing Fractal Point Processes Anand R. Kumar Abstract The temporal pattern of the action potential discharges of auditory nerve fibers have been presumed to be random in nature and well described as a renewal process. Recently, we observed that the discharges were characterized by excess variability and low frequency spectra. Simple stationary point process models, point process models with a chaotic intensity, and doubly stochastic point process models with a strongly mixing intensity process do not describe the data. But a fractal point process, defined to be a doubly stochastic point process with a fractal waveform intensity process, described the the data more generally. The sample paths of the counting process of a fractal point process are not self-similar; it displays self-similar characteristics only over large time scales. The Fano factor, rather than the power spectrum, captures these self-similar char­ acteristics. The fractal dimension and fractal time characterize the Fano factor. The fractal dimension is characteristic of the fractal intensity process and is measured from the asymptotic slope of the Fano factor plot. The fractal time is the time before the onset of the fractal behavior and delineates the short-term characteristics of the 11 data.
    [Show full text]
  • Information-Theoretic Analysis of Memoryless Deterministic Systems
    Article Information-Theoretic Analysis of Memoryless Deterministic Systems Bernhard C. Geiger 1,* and Gernot Kubin 2 1 Institute for Communications Engineering, Technical University of Munich, Munich 80290, Germany 2 Signal Processing and Speech Communication Laboratory, Graz University of Technology, Graz 8010, Austria; [email protected] * Correspondence: [email protected]; Tel.: +49-89-289-23452 Academic Editors: Raúl Alcaraz Martínez and Kevin H. Knuth Received: 18 August 2016; Accepted: 14 November 2016; Published: 17 November 2016 Abstract: The information loss in deterministic, memoryless systems is investigated by evaluating the conditional entropy of the input random variable given the output random variable. It is shown that for a large class of systems the information loss is finite, even if the input has a continuous distribution. For systems with infinite information loss, a relative measure is defined and shown to be related to Rényi information dimension. As deterministic signal processing can only destroy information, it is important to know how this information loss affects the solution of inverse problems. Hence, we connect the probability of perfectly reconstructing the input to the information lost in the system via Fano-type bounds. The theoretical results are illustrated by example systems commonly used in discrete-time, nonlinear signal processing and communications. Keywords: information processing; signal processing; system theory; Fano’s inequality; information loss; Rényi information dimension 1. Introduction When opening a textbook on linear [1] or nonlinear [2] deterministic signal processing, input-output systems are typically characterized—aside from the difference or differential equation defining the system—by energy- or power-related concepts: L2 or energy/power gain, passivity, losslessness, input-output stability, and transfer functions are all defined using the amplitudes (or amplitude functions like the squared amplitude) of the involved signals, therefore essentially energetic in nature.
    [Show full text]