Energy Distance Maria L
Total Page:16
File Type:pdf, Size:1020Kb
Advanced Review Energy distance Maria L. Rizzo1* and Gábor J. Székely2,3 Energy distance is a metric that measures the distance between the distributions of random vectors. Energy distance is zero if and only if the distributions are identical, thus it characterizes equality of distributions and provides a theo- retical foundation for statistical inference and analysis. Energy statistics are functions of distances between observations in metric spaces. As a statistic, energy distance can be applied to measure the difference between a sample and a hypothesized distribution or the difference between two or more samples in arbitrary, not necessarily equal dimensions. The name energy is inspired by the close analogy with Newton’s gravitational potential energy. Applications include testing independence by distance covariance, goodness-of-fit, nonparametric tests for equality of distributions and extension of analysis of variance, generali- zations of clustering algorithms, change point analysis, feature selection, and more. © 2015 Wiley Periodicals, Inc. How to cite this article: WIREs Comput Stat 2016, 8:27–38. doi: 10.1002/wics.1375 Keywords: Multivariate, goodness-of-fit, distance correlation, DISCO, independence INTRODUCTION (length) of its argument, E denotes expected value, and a primed random variable X0 denotes an nergy distance is a distance between probability independent and identically distributed (iid) copy of ‘ ’ Edistributions. The name energy is motivated by X; that is, X and X0 are iid. Similarly, Y and Y0 are analogy to the potential energy between objects in a iid. The squared energy distance can be defined in gravitational space. The potential energy is zero if terms of expected distances between the random and only if the location (the gravitational center) of vectors the two objects coincide, and increases as their dis- tance in space increases. One can apply the notion of potential energy to data as follows. Let X and Y be D2ðÞF,G : =2E k X−Y k −E k X−X0 k independent random vectors in ℝd, with cumulative −E k Y −Y0 k ≥ 0; distribution function (CDF) F and G, respectively. In what follows, kk denotes the Euclidean norm and the energy distance between distributions F and G is defined as the square root of D2(F,G). It can be shown that energy distance D(F,G) *Correspondence to: [email protected] satisfies all axioms of a metric, and in particular D 1Department of Mathematics and Statistics, Bowling Green State (F,G) = 0 if and only if F = G. Therefore, energy dis- University, Bowling Green, OH, USA tance provides a characterization of equality of distri- 2National Science Foundation, Arlington, VA, USA butions, and a theoretical basis for the development 3Rényi Institute of Mathematics, Hungarian Academy of Sciences, of statistical inference and multivariate analysis based Hungary on Euclidean distances. In this review we discuss sev- Conflict of interest: The authors have declared no conflicts of inter- eral of the important applications and illustrate their est for this article. implementation. Volume8,January/February2016 ©2015WileyPeriodicals,Inc. 27 Advanced Review wires.wiley.com/compstats BACKGROUND AND APPLICATION TESTING FOR EQUAL OF ENERGY DISTANCE DISTRIBUTIONS – The notion of ‘energy statistics’1 3 was introduced by Consider the null hypothesis that two random vari- Székely in 1984–1985 in a series of lectures given in ables, X and Y, have the same cumulative distribu- Budapest, Hungary, and at MIT, Yale, and Columbia. tion functions: F = G. For samples x1, …, xn and As mentioned above, Székely’s main idea and the y1, …, ym from X and Y, respectively, the E-statistic name derive from the concept of Newton’s potential for testing this null hypothesis is energy. Statistical observations can be considered as objects in a metric space that are governed by a E ðÞX,Y : =2A−B−C; statistical potential energy that is zero if and only if n,m an underlying statistical null hypothesis is true. Energy statistics (E-statistics) are a class of functions where A, B, and C are simply averages of pairwise of distances between statistical observations. distances: Several examples of one-sample, two-sample, and multi-sample energy statistics will be illustrated below. 1 Xn Xm 1 Xn Xn ’ 4 A = k xi −yj k , B = k xi −xj k , Cramér s distance is closely related, but only nm n2 in the univariate (real valued) case. For two real- i =1 j =1 i =1 j =1 1 Xm Xm valued random variables with CDFs F and G, the C = k y −y k : m2 i j squared energy distance is exactly twice the distance i =1 j =1 proposed by Harald Cramér: 2,9 2 ð ∞ One can prove that E(X, Y):=D (F, G) is zero if D2ðÞF,G =2 ðÞFxðÞ−GxðÞ2 dx: and only if X and Y have the same distribution − ∞ (F = G). It is also true that the statistic En,m is always non-negative. When the null hypothesis of equal dis- However, the equivalence of energy distance tributions is true, the test statistic with Cramer’s distance cannot extend to higher dimensions, because while energy distance is rotation nm invariant, Cramér’s distance does not have this T = En,mðÞX,Y : n + m property. A proof of the basic energy inequality, D(F,G) ≥ 0 with equality if and only if F = G follows converges in distribution to a quadratic form of from Ref 5 and also from Mattner’s result.6 An alter- independent standard normal random variables. nate proof related to a result of Morgenstern7 Under an alternative hypothesis the statistic appears in Refs 8,9. T tends to infinity stochastically as sample sizes Application to testing for equality of two distri- tends to infinity, so the energy test for equal distribu- butions appeared in Refs 8,10–12 as well as a multi- tions that rejects the null for large values of T is sample test for equality of distributions ‘distance consistent.12 components’ (DISCO).13 Goodness-of-fit tests have Because the null distribution of T depends on been developed for multivariate normality,8,9 stable the distributions of X and Y, the test is implemented distribution,14 Pareto distribution,15 and multivariate as a permutation test in the energy package,24 which Dirichlet distribution.16 Hierarchical clustering and a is available for R25 on the Comprehensive R Archive generalization of k-means clustering based on energy Network (CRAN) under general public license. The distance are developed in Refs 17,18. test is implemented in the function eqdist.etest. Generalizations and interesting special The data argument can be the data matrix or dis- cases of the energy distance have appeared in the tance matrix of the pooled sample, with the recent literature; see Refs 19–22. A similar default being the data matrix. The second argument idea related to energy distance and E-statistics were is a vector of sample sizes. Here we test whether two considered as N-distances and N-statistics in Ref 23; species of iris data differ in the distribution of their see also Ref 5. Measures of the energy distance type four dimensional measurements. In this example, have also been studied in the machine learning n = m =50 and we use 999 permutation replicates literature.21 for the test decision. 28 ©2015WileyPeriodicals,Inc. Volume8,January/February2016 WIREs Computational Statistics Energy distance To compute the energy statistic only: First, let us introduce a generalization of the energy distance. The characterization of equality of distributions by energy distance also holds if α we replace Euclidean distance by k X − Y k , where 0 < α < 2. The characterization does not hold if α = 2 because 2E k X−Yk2 −E k X−X0k2 −E k Y −Y0k2 = 0 whenever EX = EY. We denote the The E-statistic is not standardized, so it is EðÞα corresponding two-sample energy statistic by n,m. reported only for reference. To interpret the value, … … Let A = fga1, ,an1 , B = fgb1, ,bn2 be two one should normalize the statistic. One way of doing samples, and define this is to divide by an estimate of E k X−Y k. Note Xn1 Xn2 : 1 − α; that if gαðÞA,B = k ai bmk ð1Þ n1n2 i =1m =1 2 E − −E − 0 −E − 0 : D ðÞFX,FY 2 k X Y k k X X k k Y Y k; α ≤ … H = E − = E − for 0 < 2. If A1, , AK are samplesX of sizes 2 k X Y k 2 k X Y k K n , n , …, n , respectively, and N = n , the 1 2 K j =1 j then 0 ≤ H ≤ 1 with H = 0 if and only if X and Y are within-sample dispersion statistic is identically distributed. For background on permutation tests, see Ref XK ÀÁ nj 26 or Ref 27. Wα = WαðÞA ,…,A = gα A ,A ; ð2Þ 1 K 2 j j For more details, applications, and power com- j =1 parisons see Refs 11,12 and the documentation included with the energy package. The same func- and the total dispersion of the observed response is tions are generalized to handle multi-sample pro- N blems, discussed below. Tα = TαðÞA ,…,A = gαðÞA,A ; ð3Þ 1 K 2 MULTI-SAMPLE ENERGY where A is the pooled sample. The between-sample energy statistic is STATISTICS X ÀÁ nj + nk njnk ðÞα S α = E A ,A Distance Components: A Nonparametric n, nj,nk j k ≤ ≤ 2N nj + nk Extension of ANOVA X1 j <nok K ÀÁÀÁÀÁ njnk = 2gα A ,A −gα A ,A −gαðÞA ,A : Analogous to the ANOVA decomposition of vari- 2N j k j j k k ance, we partition the total dispersion of the pooled 1 ≤ j < k ≤ K samples into between and within components, called ð4Þ distance components (DISCO).