Measuring Statistical Dependence Via the Mutual Information Dimension

Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence Measuring Statistical Dependence via the Mutual Information Dimension Mahito Sugiyama1 and Karsten M. Borgwardt1;2 1Machine Learning and Computational Biology Research Group, Max Planck Institute for Intelligent Systems and Max Planck Institute for Developmental Biology, Tubingen,¨ Germany 2Zentrum fur¨ Bioinformatik, Eberhard Karls Universitat¨ Tubingen,¨ Germany fmahito.sugiyama, [email protected] Abstract al. [2011] (further analyzed in [Reshef et al., 2013]) that measures any kind of relationships between two continuous vari- We propose to measure statistical dependence be- ables. They use the mutual information obtained by discretiz- tween two random variables by the mutual in- ing data and, intuitively, MIC is the maximum mutual infor- formation dimension (MID), and present a scal- mation across a set of discretization levels. able parameter-free estimation method for this task. However, it has some significant drawbacks: First, MIC Supported by sound dimension theory, our method depends on the input parameter B(n), which is a natural gives an effective solution to the problem of detect- number specifying the maximum size of a grid used for dis- ing interesting relationships of variables in massive cretization of data to obtain the entropy [Reshef et al., 2011, data, which is nowadays a heavily studied topic in SOM 2.2.1]. This means that MIC becomes too small if we many scientific disciplines. Different from classi- choose small B(n) and too large if we choose large B(n). cal Pearson’s correlation coefficient, MID is zero Second, it has high computational cost, as it is exponential if and only if two random variables are statistically with respect to the number of data points1, and not suitable independent and is translation and scaling invari- for large datasets. Third, as pointed out by Simon and Tibshi- ant. We experimentally show superior performance rani [2012], it does not work well for relationship discovery of MID in detecting various types of relationships in the presence of noise. in the presence of noise data. Moreover, we illus- Here we propose to measure dependence between two ran- trate that MID can be effectively used for feature dom variables by the mutual information dimension, or MID, selection in regression. to overcome the above drawbacks of MIC and other machine learning based techniques. First, it contains no parameter 1 Introduction in theory and the estimation method proposed in this paper is also parameter-free. Second, its estimation is fast; the How to measure dependence of variables is a classical yet average-case time complexity is O(n log n), where n is the fundamental problem in statistics. Starting with the Galton’s number of data points. Third, MID is experimentally shown work of Pearson’s correlation coefficient [Stigler, 1989] for to be more robust to uniformly distributed noise data than measuring linear dependence, many techniques have been MIC and other methods. proposed, which are of fundamental importance in scientific The definition of MID is simple: fields such as physics, chemistry, biology, and economics. Machine learning and statistics has defined a number of MID(X; Y ) := dim X + dim Y − dim XY techniques over the last decade which are designed to mea- for two random variables X and Y , where dim X and dim Y [ sure not only linear but also nonlinear dependences Hastie are the information dimension of random variables X and ] [ et al., 2009 . Examples include kernel-based Bach and Y , respectively, and dim XY is that of the joint distribu- ] Jordan, 2003; Gretton et al., 2005 , mutual information- tion of X and Y . The information dimension is one of the [ ] based Kraskov et al., 2004; Steuer et al., 2002 , and distance- fractal dimensions [Ott, 2002] introduced by Renyi´ [1959; [ ] based Szekely´ et al., 2007; Szekely´ and Rizzo, 2009 meth- 1970], and its links to information theory were recently stud- ods. Their main limitation in practice, however, is the lack ied [Wu and Verdu,´ 2010; 2011]. Although MID itself is not of scalability or that one has to specify the type of nonlinear a new concept, this is the first study that introduces MID as a relationship one is interested in beforehand, which requires measure of statistical dependence between random variables; non-trivial parameter selection. Recently, in Science, a distinct method called maximal in- 1Since computing the exact MIC is usually infeasible, they used formation coefficient (MIC) has been proposed by Reshef et heuristic dynamic programming for efficient approximation. 1692 MID = 1 + 1 – 1 = 1 MID = 1 + 1 – 2 = 0 MID = 1 + 0 – 1 = 0 to date, MID has only been used for chaotic time series anal- Y Y Y ysis [Buzug et al., 1994; Prichard and Theiler, 1995]. Projection Projection Projection = 1 = 1 MID has desirable properties as a measure of depen- = 0 Y Y dence: For every pair of random variables X and Y , (1) Y dim dim dim XY = 1 dim MID(X; Y ) = MID(Y ; X) and 0 ≤ MID(X; Y ) ≤ 1; (2) dim XY = 2 dim XY = 1 X X X MID(X; Y ) = 0 if and only if X and Y are statistically inde- Projection Projection Projection pendent (Theorem 1); and (3) MID is invariant with respect dim X = 1 dim X = 1 dim X = 1 to translation and scaling (Theorem 3). Furthermore, MID is related to MIC and can be viewed as an extension of it (see Figure 1: Three intuitive examples. MID is one for linear Section 2.4). relationship (left), and MID is zero for independent relation- To estimate MID from a dataset, we construct an efficient ships (center and right). parameter-free method. Although the general strategy is the same as the standard method used for estimation of the Box- [ ] counting dimension [Falconer, 2003], we aim to remove all Definition 1 (Information Dimension Renyi,´ 1959 ) The parameters from the method using the sliding window strat- information dimension of X is defined as egy, where the width of a window is adaptively determined H(X ) H(X ) dim X := lim k = lim k ; from the number of data points. The average-case and the k!1 − log 2−k k!1 k worst-case complexities of our method are O(n log n) and O(n2) n where H(Xk) denotes the entropy of Xk, defined by with the number of data points, respectively, which P H(Xk) = − pk(x) log pk(x). is much faster than the estimation algorithm of MIC and the x2Z other state-of-the-art methods such as the Hilbert-Schmidt in- The information dimension for a pair of two real-valued dependence criterion (HSIC) [Gretton et al., 2005] and the variables X and Y is naturally defined as dim XY := [ ] P P distance correlation Szekely´ et al., 2007 whose time com- limk!1 H(Xk;Yk)=k, where H(Xk;Yk) = − x2 y2 O(n2) Z Z plexities are . Hence MID scales up to massive datasets pk(x; y) log pk(x; y), the joint entropy of Xk and Yk. Infor- with millions of data points. mally, the information dimension indicates how much a vari- This paper is organized as follows: Section 2 introduces able fills the space, and this property enables us to measure MID and analyzes it theoretically. Section 3 describes a prac- the statistical dependence. Notice that tical estimation method of MID. The experimental results are presented in Section 4, followed by conclusion in Section 5. 0 ≤ dim X ≤ 1; 0 ≤ dim Y ≤ 1; and 0 ≤ dim XY ≤ 2 hold since 0 ≤ H(Xk) ≤ k for each k and 2 Mutual Information Dimension 0 ≤ H(X ;Y ) ≤ H(X ) + H(Y ) ≤ 2k: In fractal and chaos theory, dimension has a crucial role since k k k k it represents the complexity of an object based on a “mea- In this paper, we always assume that dim X and dim Y surement” of it. We employ the information dimension in this exist and X and Y are Borel-measurable. Our formulation paper, which belongs to a larger family of fractal dimensions applies to pairs of continuous random variables X and Y .2 [Falconer, 2003; Ott, 2002]. 2.2 Mutual Information Dimension In the following, let N be the set of natural numbers includ- ing 0, Z the set of of integers, and R the set of real numbers. Based on the information dimension, the mutual information The base of the logarithm is 2 throughout this paper. dimension is defined in an analogous fashion to the mutual We divide the real line R into intervals of the same width information. to obtain the entropy of a discretized variable. Formally, Definition 2 (Mutual Information Dimension) For a pair of random variables X and Y , the mutual information dimen- −k z z + 1 Gk(z) := [z; z + 1) · 2 = x 2 R ≤ x < 2k 2k sion, or MID, is defined as MID(X; Y ) := dim X + dim Y − dim XY: for an integer z 2 Z. We call the resulting system Gk = f Gk(z) j z 2 Z g the partition of R at level k. Partition for 2 We can easily check that MID is also defined as the two-dimensional space is constructed from Gk as G = k I(X ; Y ) f Gk(z1) × Gk(z2) j z1; z2 2 Z g. MID(X; Y ) = lim k k (1) k!1 k 2.1 Information Dimension with the mutual information I(Xk; Yk) of Xk and Yk defined Given a real-valued random variable X, we construct for each P as I(Xk; Yk) = pk(x; y) log(pk(x; y)=pk(x)pk(y)).

Measuring Statistical Dependence Via the Mutual Information Dimension

Polarization of the Rényi Information Dimension with Applications To

Rényi Information Dimension: Fundamental Limits of Almost Lossless Analog Compression Yihong Wu, Student Member, IEEE, and Sergio Verdú, Fellow, IEEE

Understanding the Fractal Dimensions of Urban Forms Through Spatial Entropy

Information Theory and Dynamical System Predictability

Understanding Fractal Dimension of Urban Form Through Spatial Entropy

Information Measures for Deterministic Input-Output Systems

Information, Measure Shifts and Distribution Diagnostics

Information Dimension and the Probabilistic Structure of Chaos J

Modeling and Analyzing Fractal Point Processes by Anand R

Information-Theoretic Analysis of Memoryless Deterministic Systems