Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence

Measuring Statistical Dependence via the Mutual Information Dimension

Mahito Sugiyama1 and Karsten M. Borgwardt1,2 1Machine Learning and Computational Biology Research Group, Max Planck Institute for Intelligent Systems and Max Planck Institute for Developmental Biology, Tubingen,¨ Germany 2Zentrum fur¨ Bioinformatik, Eberhard Karls Universitat¨ Tubingen,¨ Germany {mahito.sugiyama, karsten.borgwardt}@tuebingen.mpg.de

Abstract al. [2011] (further analyzed in [Reshef et al., 2013]) that mea- sures any kind of relationships between two continuous vari- We propose to measure statistical dependence be- ables. They use the mutual information obtained by discretiz- tween two random variables by the mutual in- ing data and, intuitively, MIC is the maximum mutual infor- formation dimension (MID), and present a scal- mation across a set of discretization levels. able parameter-free estimation method for this task. However, it has some significant drawbacks: First, MIC Supported by sound dimension theory, our method depends on the input parameter B(n), which is a natural gives an effective solution to the problem of detect- number specifying the maximum size of a grid used for dis- ing interesting relationships of variables in massive cretization of data to obtain the entropy [Reshef et al., 2011, data, which is nowadays a heavily studied topic in SOM 2.2.1]. This means that MIC becomes too small if we many scientific disciplines. Different from classi- choose small B(n) and too large if we choose large B(n). cal Pearson’s correlation coefficient, MID is zero Second, it has high computational cost, as it is exponential if and only if two random variables are statistically with respect to the number of data points1, and not suitable independent and is translation and scaling invari- for large datasets. Third, as pointed out by Simon and Tibshi- ant. We experimentally show superior performance rani [2012], it does not work well for relationship discovery of MID in detecting various types of relationships in the presence of noise. in the presence of noise data. Moreover, we illus- Here we propose to measure dependence between two ran- trate that MID can be effectively used for feature dom variables by the mutual information dimension, or MID, selection in regression. to overcome the above drawbacks of MIC and other machine learning based techniques. First, it contains no parameter 1 Introduction in theory and the estimation method proposed in this paper is also parameter-free. Second, its estimation is fast; the How to measure dependence of variables is a classical yet average-case time complexity is O(n log n), where n is the fundamental problem in statistics. Starting with the Galton’s number of data points. Third, MID is experimentally shown work of Pearson’s correlation coefficient [Stigler, 1989] for to be more robust to uniformly distributed noise data than measuring linear dependence, many techniques have been MIC and other methods. proposed, which are of fundamental importance in scientific The definition of MID is simple: fields such as physics, chemistry, biology, and economics. Machine learning and statistics has defined a number of MID(X; Y ) := dim X + dim Y − dim XY techniques over the last decade which are designed to mea- for two random variables X and Y , where dim X and dim Y [ sure not only linear but also nonlinear dependences Hastie are the information dimension of random variables X and ] [ et al., 2009 . Examples include kernel-based Bach and Y , respectively, and dim XY is that of the joint distribu- ] Jordan, 2003; Gretton et al., 2005 , mutual information- tion of X and Y . The information dimension is one of the [ ] based Kraskov et al., 2004; Steuer et al., 2002 , and distance- fractal dimensions [Ott, 2002] introduced by Renyi´ [1959; [ ] based Szekely´ et al., 2007; Szekely´ and Rizzo, 2009 meth- 1970], and its links to were recently stud- ods. Their main limitation in practice, however, is the lack ied [Wu and Verdu,´ 2010; 2011]. Although MID itself is not of scalability or that one has to specify the type of nonlinear a new concept, this is the first study that introduces MID as a relationship one is interested in beforehand, which requires measure of statistical dependence between random variables; non-trivial parameter selection. Recently, in Science, a distinct method called maximal in- 1Since computing the exact MIC is usually infeasible, they used formation coefficient (MIC) has been proposed by Reshef et heuristic dynamic programming for efficient approximation.

1692 MID = 1 + 1 – 1 = 1 MID = 1 + 1 – 2 = 0 MID = 1 + 0 – 1 = 0 to date, MID has only been used for chaotic time series anal- Y Y Y ysis [Buzug et al., 1994; Prichard and Theiler, 1995]. Projection Projection MID has desirable properties as a measure of depen- Projection dence: For every pair of random variables X and Y , (1) dim Y = 1 dim Y = 1 dim XY = 1 dim Y = 0 MID(X; Y ) = MID(Y ; X) and 0 ≤ MID(X; Y ) ≤ 1; (2) dim XY = 2 dim XY = 1 X X X MID(X; Y ) = 0 if and only if X and Y are statistically inde- Projection Projection Projection pendent (Theorem 1); and (3) MID is invariant with respect dim X = 1 dim X = 1 dim X = 1 to translation and scaling (Theorem 3). Furthermore, MID is related to MIC and can be viewed as an extension of it (see Figure 1: Three intuitive examples. MID is one for linear Section 2.4). relationship (left), and MID is zero for independent relation- To estimate MID from a dataset, we construct an efficient ships (center and right). parameter-free method. Although the general strategy is the same as the standard method used for estimation of the Box- [ ] counting dimension [Falconer, 2003], we aim to remove all Definition 1 (Information Dimension Renyi,´ 1959 ) The parameters from the method using the sliding window strat- information dimension of X is defined as egy, where the width of a window is adaptively determined H(X ) H(X ) dim X := lim k = lim k , from the number of data points. The average-case and the k→∞ − log 2−k k→∞ k worst-case complexities of our method are O(n log n) and O(n2) n where H(Xk) denotes the entropy of Xk, defined by with the number of data points, respectively, which P H(Xk) = − pk(x) log pk(x). is much faster than the estimation algorithm of MIC and the x∈Z other state-of-the-art methods such as the Hilbert-Schmidt in- The information dimension for a pair of two real-valued dependence criterion (HSIC) [Gretton et al., 2005] and the variables X and Y is naturally defined as dim XY := [ ] P P distance correlation Szekely´ et al., 2007 whose time com- limk→∞ H(Xk,Yk)/k, where H(Xk,Yk) = − x∈ y∈ O(n2) Z Z plexities are . Hence MID scales up to massive datasets pk(x, y) log pk(x, y), the joint entropy of Xk and Yk. Infor- with millions of data points. mally, the information dimension indicates how much a vari- This paper is organized as follows: Section 2 introduces able fills the space, and this property enables us to measure MID and analyzes it theoretically. Section 3 describes a prac- the statistical dependence. Notice that tical estimation method of MID. The experimental results are presented in Section 4, followed by conclusion in Section 5. 0 ≤ dim X ≤ 1, 0 ≤ dim Y ≤ 1, and 0 ≤ dim XY ≤ 2

hold since 0 ≤ H(Xk) ≤ k for each k and 2 Mutual Information Dimension 0 ≤ H(X ,Y ) ≤ H(X ) + H(Y ) ≤ 2k. In fractal and chaos theory, dimension has a crucial role since k k k k it represents the complexity of an object based on a “mea- In this paper, we always assume that dim X and dim Y surement” of it. We employ the information dimension in this exist and X and Y are Borel-measurable. Our formulation paper, which belongs to a larger family of fractal dimensions applies to pairs of continuous random variables X and Y .2 [Falconer, 2003; Ott, 2002]. 2.2 Mutual Information Dimension In the following, let N be the set of natural numbers includ- ing 0, Z the set of of integers, and R the set of real numbers. Based on the information dimension, the mutual information The base of the logarithm is 2 throughout this paper. dimension is defined in an analogous fashion to the mutual We divide the real line R into intervals of the same width information. to obtain the entropy of a discretized variable. Formally, Definition 2 (Mutual Information Dimension) For a pair   of random variables X and Y , the mutual information dimen- −k z z + 1 Gk(z) := [z, z + 1) · 2 = x ∈ R ≤ x < 2k 2k sion, or MID, is defined as MID(X; Y ) := dim X + dim Y − dim XY. for an integer z ∈ Z. We call the resulting system Gk = { Gk(z) | z ∈ Z } the partition of R at level k. Partition for 2 We can easily check that MID is also defined as the two-dimensional space is constructed from Gk as G = k I(X ; Y ) { Gk(z1) × Gk(z2) | z1, z2 ∈ Z }. MID(X; Y ) = lim k k (1) k→∞ k 2.1 Information Dimension with the mutual information I(Xk; Yk) of Xk and Yk defined Given a real-valued X, we construct for each P as I(Xk; Yk) = pk(x, y) log(pk(x, y)/pk(x)pk(y)). level k the discrete random variable X over , whose prob- x,y∈Z k Z Informally, the larger MID(X; Y ), the stronger the statis- ability is given by Pr(Xk = z) = Pr(X ∈ Gk(z)) for each z ∈ . We denote the probability mass function for X by tical dependence between X and Y . Figure 1 shows intuitive Z k examples of the information dimension and MID. pk(x) = Pr(Xk = x), and that of the joint probability by pk(x, y) = Pr(Xk = x and Yk = y). 2Furthermore, it applies to variables with no singular component We introduce the information dimension, which intuitively in terms of the Lebesgue decomposition theorem [Renyi,´ 1959]. In shows the complexity of a random variable X as the ratio contrast, Reshef et al. [2011] (SOM 6) theoretically analyzed only comparing the change of the entropy to the change in scale. continuous variables.

1693 2.3 Properties of MID Partition  1 A MID(X; Y ) = 10

It is obvious that MID is symmetric ) k ˆ

MID(Y ; X) and that 0 ≤ MID(X; Y ) ≤ 1 holds for any , Y k pair of random variables X and Y , as the mutual information ˆ I(Xk; Yk) in equation (1) is always between 0 and k. 5 B The following is the main theorem in this paper. Entropy H ( X Theorem 1 MID(X; Y ) = 0 if and only if X and Y are 1 statistically independent. 0 Level k 0 1 C 1 5 10 Proof. (⇐) Assume that X and Y are statistically in- dependent. From equation (1), it directly follows that Figure 2: Example of a dataset (left) and illustration of es- MID(X; Y ) = 0 since I(Xk; Yk) = 0 for all k. timation process of dim XY (right). A: The width w used (⇒) Assume that X and Y are not statistically indepen- for linear regression is adaptively determined from n. B: dent. We have MID(X; Y ) = dim X − dim X|Y, where The best-fitting line. Its slope is the estimator d(X,Y ) of dim X|Y = limk→∞ H(Xk|Yk)/k with the conditional en- ˆ ˆ P dim XY . C: H(X2, Y2) from partition on the left. tropy H(Yk|Xk) = pk(x, y) log(pk(x)/pk(x, y)). x,y∈Z Thus all we have to do is to prove dim X 6= dim X|Y to show that MID(X; Y ) 6= 0 holds. From the definition of en- D on the grid G, where the probability mass of each cell is tropy, dim X = limk→∞ −E log p(Xk)/k and dim X|Y = the fraction of data points falling into the cell. limk→∞ −E log p(Xk|Yk)/k, where E denotes the expecta- The characteristic matrix M(D) is defined as M(D)x,y = ∗ ∗ tion. Since p(Xk) and p(Xk|Yk) go to p(X) and p(X|Y ) I (D, x, y)/ log min{x, y}, where I (D, x, y) is the maxi- when k → ∞, −E log p(Xk) > −E log p(Xk|Yk) if k is mum of the mutual information I(D|G) over all grids G of large enough. Thus dim X 6= dim X|Y holds.  x columns and y rows. Then, MIC is defined as MIC(D) = max M(D) . Here, when n goes to infinity, we Note that I(X ; Y ) does not converge to I(X; Y ) (and ac- xy

2.4 Comparison to MIC We modify the definition of Gk(z) as follows: the domain k −k Here we show the relationship between MID and the maximal dom(Gk) = {0, 1,..., 2 − 1}, Gk(z) := [z, z + 1) · 2 if k −k information coefficient (MIC) [Reshef et al., 2011]. Let D z ∈ {0, 1,..., 2 − 2}, and Gk(z) := [z, z + 1] · 2 if z = k 2 be a dataset sampled from a distribution (X,Y ), where X 2 −1. Moreover, we define Gk(z1, z2) := Gk(z1)×Gk(z2) and Y are continuous random variables. The dataset D is for a pair of integers z1, z2. We write Gk = { Gk(z) | z ∈ 2 2 discretized by a grid G with x rows and y columns, that is, the dom(Gk) } and Gk = { Gk(z1, z2) | z1, z2 ∈ dom(Gk) }. x-values and the y-values of D are divided into x and y bins, Given a dataset D = {(x1, y1), (x2, y2),..., (xn, yn)} respectively. The D|G is induced by sampled from a distribution (X,Y ). We use the natural es-

1694 Algorithm 1 Estimation of MID Input: Dataset D 1000 Output: Estimator of MID for each Z ∈ {X,Y, (X,Y )} do 10 ˆ ˆ Compute H(Z1),H(Z2),... until each point is isolated; MID Find the best fitted regression line for k v.s. H(Zˆ ) for each MIC k 0.1

Running time (s) PC window of width w (equation (2)) started from s ∈ SZ (w); DC d(Z) ← the gradient of the obtained line; MI end for 0.001 HSIC Output d(X) + d(Y ) − d(X,Y ); 1000 10000 100000 1000000 Number of data points timator Xˆk of Xk, where the probability is defined as Figure 3: Running time. Note that both axes have logarithmic scales. Points represent averages of 10 trials. # { (x, y) ∈ D | x ∈ G (z) } Pr(Xˆ = z) := k k n the resulting coefficient by β(s) and the coefficient of deter- (#A denotes the number of elements of the set A) for all mination by R2(s) when we apply linear regression to the set z ∈ dom(G ), and similarly defined for Pr(Yˆ = z) and the k k of w points { (s, H(Zˆs)), (s + 1,H(Zˆs+1)),..., (s + w − joint probability Pr(Xˆ = z , Yˆ = z ). This is the fraction k 1 k 2 1,H(Zˆs+w−1)) }. of points in D falling into each cell. Trivially, Xˆ (resp. Yˆ ) k k Definition 3 (Estimator of information dimension) Given converges to X (resp. Y ) almost surely when n → ∞. k k a dataset D. Define the estimator d(Z) (Z = X, Y , or the 3.2 Estimation via Sliding Windows pair (X,Y )) of the information dimension dim Z as 2  Informally, the information dimension dim X (resp. dim Y , d(Z) := β arg max s∈SZ (w)R (s) dim XY ) can be estimated as the gradient of the regression and define the estimator of MID as line over k v.s. the entropy H(Xˆk) (resp. H(Yˆk), H(Xˆk, Yˆk)) since H(Xˆk) ' dim X · k + c such that the difference of the MID(D) := d(X) + d(Y ) − d(X,Y ). two sides goes to zero as k → ∞. This approach is the same The estimator of MID is obtained by Algorithm 1 and Fig- [ as the standard one for the box-counting dimension Falconer, ure 2 shows an example of estimation. The average-case and ] 2003, Chapter 3 . However, since n is finite, only a finite the worst-case time complexities are O(n log n) and O(n2), range of k should be considered to obtain the regression line. ˆ ˆ respectively, since computation of H(Zk) takes O(n) for In particular, H(Xk) is monotonically nondecreasing as k in- each k and it should be repeated O(log n) and O(n) times creases and finally it converges to some constant value, where in the average and the worst case. each cell contains at most one data point. To effectively determine the range of k, we use the sliding window strategy to find the best fitted regression line. We set 4 Experiments the width w of each window to the maximum value satisfying We evaluate MID experimentally to check its efficiency and w 2 w effectiveness in detecting various types of relationships, and |Gw| = 2 ≤ n and |Gw| = 4 ≤ n (2) compare it to other methods including MIC. We use both syn- for estimation of dim X or dim Y and dim XY , respectively. thetic data and gene expression data. Moreover, we apply w MID to feature selection in nonlinear regression for real data If |Gw| = 2 > n holds, then there exists a dataset D such ˆ ˆ to demonstrate its potential in machine learning applications. that H(Xw) − H(Xw−1) = 0, and hence we should not take Environment: We used Mac OS X version 10.7.4 with ˆ the point (w, H(Xw)) into account in the regression for es- 2 × 3 GHz Quad-Core Intel Xeon CPU and 16 GB of mem- timating dim X. Hence w in equation (2) gives the upper ory. MID was implemented in C3 and compiled with gcc bound of the width of each window that can be effectively 4.2.1. All experiments were performed in the R environment, applied to any dataset in estimation of the dimension. version 2.15.1 [R Core Team, 2012]. Comparison partners: We used Pearson’s correlation co- 3.3 MID Estimator efficient (PC), the distance correlation (DC) [Szekely´ et al., Here we give an estimator of MID using the sliding window 2007], mutual information (MI) [Kraskov et al., 2004], the strategy. For simplicity, we use the symbol Z to represent the Hilbert-Schmidt independence criterion (HSIC) [Gretton et random variables X, Y , or the pair (X,Y ). al., 2005], and MIC (heuristic approximation) [Reshef et al., Let w be the maximum satisfying equation (2) and define 2011]. DC was calculated by the R energy package; MI and 4   MIC by the official source code ; HSIC was implemented in H(Zˆk+1) − H(Zˆk) 6= 0 for any SZ (w) := s ∈ N . 3 k ∈ {s, s + 1, . . . , s + w − 1} The source code for MID is available at http://webdav. tuebingen.mpg.de/u/karsten/Forschung/MID This is the set of starting points of the windows. Note that 4MIC: http://www.exploredata.net, MI: http://www.klab.caltech. this set is always finite since H(Zˆk) converges. We denote edu/∼kraskov/MILCA

1695 R (computationally expensive processes were done by effi- a c cient packages written in C or Fortran). All parameters in 15 0.20 : p < 0.05 MIC were set to the default values. Gaussian kernels with 0.15 10 the width set to the median of the pairwise distances between 0.10 the samples were used in HSIC, which is a popular heuristics 5 0.05 [Gretton et al., 2005; 2008]. Spellman 0

0 Correlation coe cient (R) 0.2 0.4 0.6 0.8 –0.05 4.1 Scalability MID MIC PC DC MI HSIC b MID d We checked the scalability of each method. We used the lin- 15 0.70 ear relationship without any noise, and varied the number n 0.65 10 of data points from 1, 000 to 1, 000, 000. 0.60 UC

Results of running time are shown in Figure 3 (we could A 5 0.55 not calculate DC and HSIC for more than n = 100, 000 due Spellman 2 0.50 to their high space complexity O(n )). These results clearly 0 show that on large datasets, MID is much faster than other 0.2 0.4 0.6 0.8 1.0 0.45 MIC MID MIC PC DC MI HSIC methods including MIC which can detect nonlinear depen- dence. Notice that time complexities of MIC, DC, and HSIC Figure 5: Associations of gene expression data. (a, b) MID are O(n2). PC is faster than MID, but unlike MID, it cannot and MIC versus Spellman’s score. (c) Correlation coefficient detect nonlinear dependence. (R). (d) AUC scores. 4.2 Effectivity in Measuring Dependence Figure 4d is the average of the means of AUC over all re- First, we performed ROC curve analysis using synthetic data lationship types (noise ratio from 0.5 to 0.9 were taken into to check both precision and recall in detection of various rela- account). The score of MID was significantly higher than tionship types in the presence of uniformly distributed noise other methods (the Wilcoxon signed-rank test, α = 0.05). data. We prepared twelve types of relationships shown in Fig- Next, we evaluated MID using the cdc15 yeast (Saccha- [ et al. ] ure 4a, same datasets were used in Reshef , 2011 . For romyces cerevisiae) gene expression dataset from [Spellman generation of noisy datasets, we fixed the ratio of uniformly et al., 1998], which was used in [Reshef et al., 2011] and distributed noise in a dataset in each experiment, and varied contains 4381 genes5. This dataset was originally analyzed from 0 to 1. Figure 4c shows examples of linear relationships to identify genes whose expression levels oscillate during the n = 300 0.5 with noise ( ). For instance, if the noise ratio is , cell cycle. We evaluated dependence between each time se- 150 points come from the linear relationship, and the remain- ries against time in the same way as [Reshef et al., 2011, 150 ing points are uniformly distributed noise. SOM 4.7]. The number n = 23 for each gene. We set the number of data points n = 300. In each exper- Figure 5a, b illustrate scatter plots of MID and MIC versus iment, we generated 1000 datasets, where 500 are sampled Spellman’s score. We observe that genes with high Spell- from the relationship type with noise (positive datasets), and man’s scores tend to have high MID scores: the correlation the remaining 500 sets are just noise, that is, composed of coefficient (R) between Spellman’s scores and MID was sig- statistically independent variables (negative datasets). Then, nificantly higher than others (α = 0.05, Figure 5c). we computed the AUC score from the ranking of such 1000 To check both precision and recall in real data, we again datasets. Thus both precision and recall are taken into ac- performed ROC curve analysis. That is, we first ranked genes [ ] count, while Reshef et al. 2011 evaluated only recall. by measuring dependence, followed by computing the AUC Figure 4b shows results of ROC curve analysis for each score. Genes whose Spellman’s scores are more than 1.314 relationship type. We omit results for noise ratio from 0 to were treated as positives as suggested by Spellman et al. 0.4 since all methods except for PC have the maximum AUC [1998]. Figure 5d shows the resulting AUC scores. It con- in most cases. In more than half of the cases, MID showed firms that MID is the most effective compared to other meth- the best performance. Specifically, the performance of MID ods in measuring dependence. The AUC score of PC is low was superior to MIC in all cases except for two sinusoidal since it cannot detect nonlinear dependence. DC, MI, and types. Moreover, although MI showed better performance HSIC also show low AUC scores. The reason might be the than MID in four cases, its performance was much worse than lack of power for detection due to the small n. MID and MIC in sinusoidal types. In addition, MI is not nor- malized, hence it is difficult to illustrate the strength of depen- 4.3 Effectivity in Feature Selection dence in the real world situations, as mentioned by Reshef et Finally, we examined MID as a scoring method for feature [ ] al. 2011 . In contrast to MI, MID showed reasonable per- selection. To illustrate the quality of feature selection, we formance in all relationship types. Since MID was shown to performed kNN regression using the selected features to esti- be much faster than MIC, DC, and MI, these results indicate mate the nonlinear target function, where the average of the k that MID is the most appropriate among these methods for nearest neighbors is calculated as an estimator for each data measuring dependence in massive data. Note that MIC’s per- formance is often worst except for sinusoidal types, which 5http://www.exploredata.net/Downloads/Gene-Expression- confirms the report by Simon and Tibshirani [2012]. Data-Set

1696 a b c Linear Parabolic Cubic Linear Parabolic Cubic 1.0 Noise ratio = 0 Noise ratio = 0.2 0.9 MID 0.8 MIC

UC 0.7 PC A 0.6 DC MI 0.5 HSIC

0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.6 0.7 0.8 0.9 1.0 Noise ratio = 0.4 Noise ratio = 0.6 Exponential Sinusoidal (non−Fourier) Sinusoidal (Varying) Exponential Sinusoidal (non−Fourier) Sinusoidal (Varying) 1.0 0.9 0.8

UC 0.7 A 0.6 0.5 Noise ratio = 0.8 Noise ratio = 1 0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.6 0.7 0.8 0.9 1.0 Two Lines Line and Parabola X Two Lines Line and Parabola X 1.0 0.9 0.8

UC 0.7 A 0.6 0.5 d : p < 0.05 0.90 0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.6 0.7 0.8 0.9 1.0 Ellipse Linear/Periodic Two Sinusoidals Ellipse Linear/Periodic Two Sinusoidals 0.85 1.0 0.9 0.80 0.8 UC 0.7 AUC (average) 0.75 A 0.6 0.70 0.5 MID MIC PC DC MI HSIC 0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.6 0.7 0.8 0.9 1.0 Noise ratio Noise ratio Noise ratio

Figure 4: ROC curve analysis for synthetic data. (a) Relationship types (n = 300 each). (b) AUC scores. (c) Examples of noisy data for linear relationship (n = 300 each). (d) The average of the means of AUC over all relationship types. point (k was set to 3). We measured the quality of prediction MID, whose average-case time complexity is O(n log n). by the mean squared error (MSE). For each dataset, we gen- This method has been experimentally shown to be scalable erated training and test sets by 10-fold cross validation and and effective for various types of relationships, and for non- performed feature selection using only the training set. Eight linear feature selection in regression. real datasets were collected from UCI repository [Frank and MID overcomes the drawbacks of MIC: (1) MID con- Asuncion, 2010] and StatLib6. tains no parameter; (2) MID is fast and scales up to mas- We adopted the standard filter method, that is, we measured sive datasets; and (3) MID shows superior performance in de- dependence between each feature and the target variable, and tecting relationships in the presence of uniformly distributed produced the ranking of the features from the training set. noise. In [Reshef et al., 2011], an additive noise model was We then applied kNN regression (based on the training set) considered; MID might not work well for this type of noise to the test set repeatedly, varying the number of selected top model. Since MID is based on dimension theory, it tends ranked features. Results for each dataset are shown in Fig- to judge that the true distribution is a two-dimensional plane, ure 6a, and the average R2, which is the correlation between with no dependence. We will further explore the exciting con- predicted and target values over all datasets, is summarized in nection between dimension theory and statistical dependence Figure 6b. MID performed significantly better than any other estimation to address this challenge. methods (the Wilcoxon signed-rank test, α = 0.05). References 5 Conclusion We have developed a new method based on dimension the- [Bach and Jordan, 2003] F. R. Bach and M. I. Jordan. Ker- ory for measuring dependence between real-valued random nel independent component analysis. Journal of Machine variables in large datasets, the mutual information dimen- Learning Research, 3:1–48, 2003. sion (MID). MID has desirable properties as a measure of [Buzug et al., 1994] Th. Buzug, K. Pawelzik, J. von Stamm, dependence, that is, it is always between zero and one, it and G. Pfister. Mutual information and global strange is zero if and only if variables are statistically independent, attractors in Taylor-Couette flow. Physica D: Nonlinear and it is translation and scaling invariant. Moreover, we have Phenomena, 72(4):343–350, 1994. constructed an efficient parameter-free estimation method for [Falconer, 2003] K. Falconer. Fractal Geometry: Mathemat- 6http://lib.stat.cmu.edu/datasets/ ical Foundations and Applications. Wiley, 2003.

1697 a Abalone (4177, 7, UCI) Mpg (392, 7, UCI) Bodyfat (252, 14, StatLib) 7 10–5 [R Core Team, 2012] R Core Team. R: A Language and En- 7.5 26 6 10–5 vironment for Statistical Computing. R Foundation for 7.0 24 5 10–5 Statistical Computing, 2012. –5

MSE 4 10 6.5 22 3 10–5 [Renyi,´ 1959] ARenyi.´ On the dimension and entropy of 6.0 20 2 10–5 probability distributions. Acta Mathematica Hungarica, 1 2 3 4 5 6 7 1 2 3 4 5 6 7 2 4 6 8 10 12 14 10(1–2):193–215, 1959. Triazines (185, 60, UCI) Pyrimidines (74, 27, UCI) Communities (2215, 101, UCI) 90 0.018 [Renyi,´ 1970] ARenyi.´ Probability Theory. North-Holland 80 0.030 0.016 Publishing Company and Akademiai´ Kiado,´ Publishing 0.014 70

MSE 0.025 House of the Hungarian Academy of Sciences, 1970. Re- 0.012 60 0.020 0.010 50 published from Dover in 2007. 0.008 5 10 15 20 5 10 15 20 5 10 15 20 [Reshef et al., 2011] D. N. Reshef, Y. A. Reshef, H. K. Fin- # of top ranked features Stock (950, 9, StatLib) Space_ga (3107, 6, StatLib) ucane, S. R. Grossman, G. McVean, P. J. Turnbaugh, 0.036 8 MID b : p < 0.05 0.50 E. S. Lander, M. Mitzenmacher, and P. C. Sabeti. De- MIC 0.032 6 PC tecting novel associations in large data sets. Science, DC

MSE 4 0.028 MI 0.46 334(6062):1518–1524, 2011. 2 HSIC 0.024 R (average) [Reshef et al., 2013] D. N. Reshef, Y. A. Reshef, M. Mitzen- 2 4 6 8 1 2 3 4 5 6 0.42 MID MIC PC DC MI HSIC macher, and P. C. Sabeti. Equitability analysis of # of top ranked features # of top ranked features the maximal information coefficient, with comparisons. arXiv:1301.6314, 2013. Figure 6: Feature selection for real data. (a) MSE (mean squared error) against top ranked features (n, d, and data [Simon and Tibshirani, 2012] N. Simon and R. Tibshirani. source are denoted in each parenthesis). For Triazines, Pyrim- Comment on “Detecting novel associations in large data idines, and Communities, we only illustrate results from the sets” by Reshef et al., Science 2011, 2012. http://www- first 20 top ranked features. Points represent averages of 10 stat.stanford.edu/∼tibs/reshef/comment.pdf. 2 trials. (b) The average of the means of R over every datasets. [Spellman et al., 1998] P. T. Spellman, G. Sherlock, M. Q. Zhang, V. R. Iyer, K. Anders, M. B. Eisen, P. O. Brown, D. Botstein, and B. Futcher. Comprehensive identification [ ] Frank and Asuncion, 2010 A. Frank and A. Asuncion. UCI of cell cycle–regulated genes of the yeast Saccharomyces machine learning repository, 2010. cerevisiae by microarray hybridization. Molecular Biology [Gretton et al., 2005] A. Gretton, O. Bousquet, A. Smola, of the Cell, 9(12):3273–3297, 1998. and B. Scholkopf.¨ Measuring statistical dependence [Steuer et al., 2002] R. Steuer, J. Kurths, C. O. Daub, with Hilbert-Schmidt norms. In Proceedings of the 16th J. Weise, and J. Selbig. The mutual information: Detecting International Conference on Algorithmic Learning The- and evaluating dependencies between variables. Bioinfor- ory, Lecture Notes in Computer Science, pages 63–77. matics, 18(supple 2):S231–S240, 2002. Springer, 2005. [Stigler, 1989] S. M. Stigler. Francis Galton’s account of the [Gretton et al., 2008] A. Gretton, K. Fukumizu, C. H. Teo, invention of correlation. Statistical Science, 4(2):73–79, L. Song, B. Scholkopf,¨ and A. Smola. A kernel statistical 1989. test of independence. In Advances in Neural Information [Szekely´ and Rizzo, 2009] G. J. Szekely´ and M. L. Rizzo. Processing Systems, volume 20, pages 585–592, 2008. Brownian distance covariance. Annals of Applied Statis- [Hastie et al., 2009] T. Hastie, R. Tibshirani, and J. Fried- tics, 3(4):1236–1265, 2009. man. The Elements of Statistical Learning: Data Mining, [Szekely´ et al., 2007] G. J. Szekely,´ M. L. Rizzo, and N. K. Inference, and Prediction. Springer, 2 edition, 2009. Bakirov. Measuring and testing dependence by correlation of distances. Annals of Statistics, 35(6):2769–2794, 2007. [Kraskov et al., 2004] A. Kraskov, H. Stogbauer,¨ and P. Grassberger. Estimating mutual information. Physical [Wu and Verdu,´ 2010] Y. Wu and S. Verdu.´ Renyi´ informa- Review E, 69(6):066138–1–16, 2004. tion dimension: Fundamental limits of almost lossless ana- log compression. IEEE Transactions on Information The- [Ott et al., 1984] E. Ott, W. D. Withers, and J. A. Yorke. Is ory, 56(8):3721–3748, 2010. the dimension of chaotic attractors invariant under coordi- [Wu and Verdu,´ 2011] Y. Wu and S. Verdu.´ MMSE di- nate changes? Journal of Statistical Physics, 36(5):687– mension. IEEE Transactions on Information Theory, 697, 1984. 57(8):4857–4879, 2011. [Ott, 2002] E. Ott. Chaos in Dynamical Systems. Cambridge University Press, 2 edition, 2002. [Prichard and Theiler, 1995] D. Prichard and J. Theiler. Gen- eralized redundancies for time series analysis. Physica D: Nonlinear Phenomena, 84(3–4):476–493, 1995.

1698