Advanced Review Energy Maria L. Rizzo1* and Gábor J. Székely2,3

Energy distance is a that measures the distance between the distributions of random vectors. Energy distance is zero if and only if the distributions are identical, thus it characterizes equality of distributions and provides a theo- retical foundation for statistical inference and analysis. Energy statistics are functions of between observations in metric spaces. As a statistic, energy distance can be applied to measure the difference between a sample and a hypothesized distribution or the difference between two or more samples in arbitrary, not necessarily equal dimensions. The name energy is inspired by the close analogy with Newton’s gravitational potential energy. Applications include testing independence by distance covariance, goodness-of-fit, nonparametric tests for equality of distributions and extension of analysis of variance, generali- zations of clustering algorithms, change point analysis, feature selection, and more. © 2015 Wiley Periodicals, Inc.

How to cite this article: WIREs Comput Stat 2016, 8:27–38. doi: 10.1002/wics.1375

Keywords: Multivariate, goodness-of-fit, , DISCO, independence

INTRODUCTION (length) of its argument, E denotes , and a primed random variable X0 denotes an nergy distance is a distance between probability independent and identically distributed (iid) copy of ‘ ’ Edistributions. The name energy is motivated by X; that is, X and X0 are iid. Similarly, Y and Y0 are analogy to the potential energy between objects in a iid. The squared energy distance can be defined in gravitational space. The potential energy is zero if terms of expected distances between the random and only if the location (the gravitational center) of vectors the two objects coincide, and increases as their dis- tance in space increases. One can apply the notion of potential energy to data as follows. Let X and Y be D2ðÞF,G : =2E k X−Y k −E k X−X0 k independent random vectors in ℝd, with cumulative −E k Y −Y0 k ≥ 0; distribution function (CDF) F and G, respectively. In what follows, kk denotes the Euclidean norm and the energy distance between distributions F and G is defined as the square root of D2(F,G). It can be shown that energy distance D(F,G) *Correspondence to: [email protected] satisfies all axioms of a metric, and in particular D 1Department of Mathematics and Statistics, Bowling Green State (F,G) = 0 if and only if F = G. Therefore, energy dis- University, Bowling Green, OH, USA tance provides a characterization of equality of distri- 2National Science Foundation, Arlington, VA, USA butions, and a theoretical basis for the development 3Rényi Institute of Mathematics, Hungarian Academy of Sciences, of statistical inference and multivariate analysis based Hungary on Euclidean distances. In this review we discuss sev- Conflict of interest: The authors have declared no conflicts of inter- eral of the important applications and illustrate their est for this article. implementation.

Volume8,January/February2016 ©2015WileyPeriodicals,Inc. 27 Advanced Review wires.wiley.com/compstats

BACKGROUND AND APPLICATION TESTING FOR EQUAL OF ENERGY DISTANCE DISTRIBUTIONS – The notion of ‘energy statistics’1 3 was introduced by Consider the null hypothesis that two random vari- Székely in 1984–1985 in a series of lectures given in ables, X and Y, have the same cumulative distribu- Budapest, Hungary, and at MIT, Yale, and Columbia. tion functions: F = G. For samples x1, …, xn and As mentioned above, Székely’s main idea and the y1, …, ym from X and Y, respectively, the E-statistic name derive from the concept of Newton’s potential for testing this null hypothesis is energy. Statistical observations can be considered as objects in a that are governed by a E ðÞX,Y : =2A−B−C; statistical potential energy that is zero if and only if n,m an underlying statistical null hypothesis is true. Energy statistics (E-statistics) are a class of functions where A, B, and C are simply averages of pairwise of distances between statistical observations. distances: Several examples of one-sample, two-sample, and multi-sample energy statistics will be illustrated below. 1 Xn Xm 1 Xn Xn ’ 4 A = k xi −yj k , B = k xi −xj k , Cramér s distance is closely related, but only nm n2 in the univariate (real valued) case. For two real- i =1 j =1 i =1 j =1 1 Xm Xm valued random variables with CDFs F and G, the C = k y −y k : m2 i j squared energy distance is exactly twice the distance i =1 j =1 proposed by Harald Cramér:

2,9 2 ð ∞ One can prove that E(X, Y):=D (F, G) is zero if D2ðÞF,G =2 ðÞFxðÞ−GxðÞ2 dx: and only if X and Y have the same distribution − ∞ (F = G). It is also true that the statistic En,m is always non-negative. When the null hypothesis of equal dis- However, the equivalence of energy distance tributions is true, the test statistic with Cramer’s distance cannot extend to higher dimensions, because while energy distance is rotation nm invariant, Cramér’s distance does not have this T = En,mðÞX,Y : n + m property. A proof of the basic energy inequality, D(F,G) ≥ 0 with equality if and only if F = G follows converges in distribution to a quadratic form of from Ref 5 and also from Mattner’s result.6 An alter- independent standard normal random variables. nate proof related to a result of Morgenstern7 Under an alternative hypothesis the statistic appears in Refs 8,9. T tends to infinity stochastically as sample sizes Application to testing for equality of two distri- tends to infinity, so the energy test for equal distribu- butions appeared in Refs 8,10–12 as well as a multi- tions that rejects the null for large values of T is sample test for equality of distributions ‘distance consistent.12 components’ (DISCO).13 Goodness-of-fit tests have Because the null distribution of T depends on been developed for multivariate normality,8,9 stable the distributions of X and Y, the test is implemented distribution,14 Pareto distribution,15 and multivariate as a permutation test in the energy package,24 which Dirichlet distribution.16 and a is available for R25 on the Comprehensive R Archive generalization of k-means clustering based on energy Network (CRAN) under general public license. The distance are developed in Refs 17,18. test is implemented in the function eqdist.etest. Generalizations and interesting special The data argument can be the data matrix or dis- cases of the energy distance have appeared in the tance matrix of the pooled sample, with the recent literature; see Refs 19–22. A similar default being the data matrix. The second argument idea related to energy distance and E-statistics were is a vector of sample sizes. Here we test whether two considered as N-distances and N-statistics in Ref 23; species of iris data differ in the distribution of their see also Ref 5. Measures of the energy distance type four dimensional measurements. In this example, have also been studied in the n = m =50 and we use 999 permutation replicates literature.21 for the test decision.

28 ©2015WileyPeriodicals,Inc. Volume8,January/February2016 WIREs Computational Statistics Energy distance

To compute the energy statistic only: First, let us introduce a generalization of the energy distance. The characterization of equality of distributions by energy distance also holds if α we replace Euclidean distance by k X − Y k , where 0 < α < 2. The characterization does not hold if α = 2 because 2E k X−Yk2 −E k X−X0k2 −E k Y −Y0k2 = 0 whenever EX = EY. We denote the The E-statistic is not standardized, so it is EðÞα corresponding two-sample energy statistic by n,m. reported only for reference. To interpret the value, … … Let A = fga1, ,an1 , B = fgb1, ,bn2 be two one should normalize the statistic. One way of doing samples, and define this is to divide by an estimate of E k X−Y k. Note Xn1 Xn2 : 1 − α; that if gαðÞA,B = k ai bmk ð1Þ n1n2 i =1m =1 2 E − −E − 0 −E − 0 : D ðÞFX,FY 2 k X Y k k X X k k Y Y k; α ≤ … H = E − = E − for 0 < 2. If A1, , AK are samplesX of sizes 2 k X Y k 2 k X Y k K n , n , …, n , respectively, and N = n , the 1 2 K j =1 j then 0 ≤ H ≤ 1 with H = 0 if and only if X and Y are within-sample dispersion statistic is identically distributed. For background on permutation tests, see Ref XK nj 26 or Ref 27. Wα = WαðÞA ,…,A = gα A ,A ; ð2Þ 1 K 2 j j For more details, applications, and power com- j =1 parisons see Refs 11,12 and the documentation included with the energy package. The same func- and the total dispersion of the observed response is tions are generalized to handle multi-sample pro- N blems, discussed below. Tα = TαðÞA ,…,A = gαðÞA,A ; ð3Þ 1 K 2

MULTI-SAMPLE ENERGY where A is the pooled sample. The between-sample energy statistic is STATISTICS X nj + nk njnk ðÞα S α = E A ,A Distance Components: A Nonparametric n, nj,nk j k ≤ ≤ 2N nj + nk Extension of ANOVA X1 j

Volume8,January/February2016 ©2015WileyPeriodicals,Inc. 29 Advanced Review wires.wiley.com/compstats

For every 0 < α < 2, the statistic Eq. (4) deter- E-clustering mines a consistent test of the multi-sample hypothesis Energy distance has been applied in hierarchical clus- 13 17 of equal distributions. In the special case where all Fj ter analysis. It generalizes the well-known Ward’s α are univariate distributions and =2, Sn,2 is the minimum variance method in a similar way that ANOVA between sample sum of squared error and DISCO generalizes ANOVA. The DISCO decomposi- the decomposition T2 = S2 + W2 is the ANOVA tion has recently been applied to generalize k-means decomposition. The ANOVA test statistic measures clustering.18 differences in means, not distributions. However, if we In an agglomerative hierarchical clustering apply α = 1 (Euclidean distance) or any 0 < α <2as algorithm, starting with single observations, at each the exponent on Euclidean distance, the corresponding step we merge clusters that have minimum cluster energy test is consistent against all alternatives with distance. In the energy distance algorithm, the cluster finite α moments. If any of the underlying distributions distance is the two sample energy statistic. may have non-finite first moment, a suitable choice of There is a general class of hierarchical cluster- α extends the energy test to this situation. ing algorithms uniquely determined by their respec- Returning to the iris data example, we can eas- tive recursive formula for updating all cluster ily apply the test for equality of the three species’ dis- distances following each merge of two clusters. One tributions using a choice of methods, and the can show17 that the energy clustering algorithm relevant options here are method="discoB" or is also a member of this class and its recursive for- method="discoF". The first uses the between- mula shows that it it is formally similar to Ward’s sample statistic, and the second uses an ‘F ’ ratio like method. ANOVA. Both are implemented as permutation tests.

Sn,α=ðÞK−1 Fn,α = ; Wn,α=ðÞN −K Suppose that the disjoint clusters Ci, Cj are to and the decomposition details are displayed in a table be merged at the current step. If Ck is a disjoint clus- similar to an ANOVA table. Although it has the ter, then the new cluster distances can be computed same form as an F statistic, it does not have an by the following recursive formula: F distribution and a permutation test is applied.

ni + nk dCi [ Cj,Ck = dCðÞi,Ck One can obtain a table of pairwise energy sta- ni + nj + nk n + n tistics for any number of samples in one step using + j k dC,C edist j k ð5Þ the function in the energy package. It can be ni + nj + nk nk used, for example, to display E-statistics for the − dCi,Cj ; result of a cluster analysis. ni + nj + nk

30 ©2015WileyPeriodicals,Inc. Volume8,January/February2016 WIREs Computational Statistics Energy distance

E where dCi,Cj = ni,nj Ci,Cj , and ni, nj, nk are the Distance Covariance sizes of clusters Ci, Cj, Ck, respectively. Let dij := d The simplest formula for the distance covariance sta- (Ci, Cj). Then tistic is the square root of

Xn : 1 b b dðÞij k = dCi [ Cj,Ck V2ðÞX,Y = A B ; α α β n n2 ij ij = i dCðÞi,Ck + j dCj,Ck + dCi,Cj i,j =1 = αidik + αjdjk + βdij + γjdik −djkj; where  and Bb are the double-centered distance where matrices of the X sample and the Y sample, respec- tively, and the subscript ij denotes the entry in the − ni + nk nk i-th row and j-th column. The double-centered dis- αi = ; β = ; γ =0: ni + nj + nk ni + nj + nk tance matrices are computed as in classical multidi- mensional scaling. Given a random sample If we substitute squared Euclidean distances for (x, y)={(xi, yi):i =1,…, n} from the joint distribu- Euclidean distances in this recursive formula, keeping tion of random vectors X in ℝp and Y in ℝq, com- − the same parameters (αi, αj, β, γ), then we obtain the pute the Euclidean distance matrix (aij)=(kxi xjk) updating formula for Ward’s minimum variance for the X sample and (bij)=(kyi − yjk) for the method. However, we know that Ward’s method Y sample. The ij-th entry of  is (with exponent α = 2 on distances) is a geometrical method that separates clusters by their centers, not b − − … ; by their distributions. E-clustering generalizes Ward Aij = aij ai: a:j + a.., i,j =1, ,n because for every 0 < α < 2, the energy clustering algorithm separates clusters that differ in where distribution. Overall in simulations and real data Xn Xn Xn 1 1 1 17 E ai: = aij, a:j,= aij, a.. = aij: examples the characterization property of is a n n n2 clear advantage for certain clustering problems, with- j =1 i =1 i,j =1 out sacrificing the good properties of Ward’s mini- b mum variance method for separating spherical Similarly, the ij-th entry of B is clusters. b − − … : The hclust hierarchical clustering function Bij = bij bi: b:j + b.., i,j =1, ,n provided in R25 implements exactly the above recur- sive formula Eq. (5), and therefore one can use The sample distance variance is hclust to apply either the E-clustering solution or Xn ’ ‘ward.D’ 1 b Ward s method by specifying method V2ðÞX = V2ðÞX,X = A2 : ‘ward.D2’ n n n2 ij (energy) or (Ward). i,j =1

The distance covariance statistic is always non-nega- 2 tive, and VnðÞX = 0 only if all of the sample observa- TESTING INDEPENDENCE tions are identical (see Ref 28). An important application for two samples applies to testing independence of random vectors. In this case, we test whether the joint distribution of X and Y is Distance Correlation equal to the product of their marginal distributions. Distance correlation is the standardized distance Interestingly, the statistics can be expressed in a covariance. We have defined the squared distance – product moment expression involving the double- covariance statistic centered distance matrices of the X and Y samples. The statistics based on distances are analogous to, Xn 1 b b but more general than, product–moment covariance V2ðÞX,Y = A B ; n n2 ij ij and correlation. This suggests the names distance i,j =1 covariance (dCov) and distance correlation (dCor), defined below. and the squared distance correlation is defined by

Volume8,January/February2016 ©2015WileyPeriodicals,Inc. 31 Advanced Review wires.wiley.com/compstats

8 > 2 8 V2ðÞX,Y , Vn ðÞVX n ðÞY >0 ; 0 ; RnðÞX,Y = V ðÞVX ðÞY 2 2 2 :> n n R ðÞX,Y = > V ðÞVX ðÞY 2 2 : : 0, Vn ðÞVX n ðÞY =0 0, V2 ðÞVX 2 ðÞY =0: Distance correlation satisfies The empirical coefficients VnðÞX,Y and Rn(X, Y) converge almost surely to the population coefficients 1. 0 ≤ Rn(X, Y) ≤ 1. VðÞX,Y and R(X, Y), as n ! ∞. The distribution of 2 2. If Rn(X, Y) = 1 then there exists a vector a,a Qn = nV ðÞX,Y converges to a quadratic form X∞ n non-zero real number b and an orthogonal λ Z2, where Z are iid standard normal and λ i =1 i i i i matrix R such that Y = a + bXR, for the data are non-negative coefficients that depend on the dis- matrices X and Y. tributions of the underlying random variables. Values of Qn near zero are consistent with the null hypothe- fi 2 Remark 1. One could also de ne dCov as Vn and sis of independence, while large Qn support the alter- 2 native. Thus, a consistent test of multivariate dCor as Rn rather than by their respective square roots. There are reasons to prefer each defini- independence is based on the distance covariance 28,29 fi 2 tion, but historically the above de nitions were Qn = nVn, and it can be implemented as a nonpara- used. When we deal with unbiased statistics, we no metric permutation test. The dCov test applies to longer have the non-negativity property, so we can- random vectors in arbitrary, not necessarily equal not take the square root and need to work with the dimensions, for any sample size n ≥ 4. For high square. dimensional X and Y, there is also a distance correla- tion t-test of independence introduced in Ref 30 is applicable when dimension exceeds sample size. For the permutation test in this case, the null Population Coefficients distribution is sampled by permuting the indices of Suppose that X 2 ℝp, Y 2 ℝq, E k X k < ∞, and one of the two variables each time a sample is drawn. E k Y k < ∞. The squared population distance covari- The replicates then provide a reference distribution ance coefficient can be written in terms of expected for estimating the tail probability to the right of the distances: observed test statistic Qn.

V2ðÞX,Y = E k X−X0 kk Y −Y0 k + E k X−X0 kE Remark 2. Note that the permutation test is only k Y −Y0 k −2E k X−X0 kk Y −Y00 k ; applicable if the observations are exchangeable under the null, which would not be true e.g., for time where (X,Y), (X0, Y0), and (X00, Y00) are iid.29 Here series data. V2ðÞX,Y is an energy distance between the joint dis- The statistics and tests are implemented in the tribution of (X,Y) and the product of the marginal energy package for R.24,25 Functions dcor, dcov, distributions of X and Y. dcov.test and dcor.ttest compute the statistics We have a characterization of independence: and tests of independence. The following example V2ðÞX,Y ≥ 0 with equality to 0 if and only if X and uses the crabs data in the MASS package. After Y are independent. Population distance correlation converting the binary factors to integers, we test if R(X, Y) is the square root of the standardized the two-dimensional categorical variable (species, sex) coefficient: is independent of the vector of body measurements.

32 ©2015WileyPeriodicals,Inc. Volume8,January/February2016 WIREs Computational Statistics Energy distance

The data arguments to dcov.test can be data As the first step in computing the unbiased sta- matrices or distance objects returned by the R dist tistic, we replace the double centering operation with function. Here the arguments are data matrices. U-centering, to obtain U-centered distance matrices à and Be. Then X e e 1 e e AB : = A B , nnðÞ−3 i,j i,j i6¼j

is an unbiased estimator of squared population dis- tance covariance V2ðÞX,Y . The inner product nota- tion is due to the fact that this statistic is an inner product in the Hilbert space of U-centered distance matrices.31 fi 2 fi The test is signi cant and we reject independ- A bias corrected Rn is de ned by normalizing ence. One may also want to compute the statistics the inner product statistic with the bias corrected using dcor or dcov: dVar statistics. The bias-corrected dCor statistic is implemented in the R energy package by the bcdcor function. Returning to the example, here is a compar- 2 ison of biased and bias-corrected Rn:

To recover all of the statistics from dCor, a util- ity function is provided:

The above unbiased inner product dCov statis- tic is easy to compute, but since it can take negative values, we cannot define the bias corrected statistic An Unbiased Distance Covariance to be the square root of it. Thus we avoid the Statistic ‘squared’ notation and use the inner product opera- * * fi ‘ ’ The population distance covariance is zero under tor or Vn and Rn. One could have de ned dCov independence, but the dCov statistic is non-negative, from the start to be the square of the energy distance. hence its expected value is positive except in degener- Historically, the rationale for choosing the square ate cases. Clearly the dCov statistic in its original for- root definition is that in this case distance covariance mulation is biased for the population coefficient. The is the energy distance between the joint distribution bias is in fact increasing with dimension. of the variables and the product of their marginals. A An unbiased estimator of V2ðÞX,Y was given in disadvantage is that the distance variance, rather Ref. 30 and an equivalent unbiased statistic was than the distance standard deviation, is measured in given in Ref 31. Although the latter one looks sim- the same units as the distances. pler, the original may be faster to compute. The fol- It is clear that both the sample dCov and the lowing is from Ref 31. sample dCor can be computed in O(n2) steps. 32 Let A =(aij) be a symmetric, real valued n × n Recently Huo and Székely proved that for real val- matrix with zero diagonal (not necessarily Euclidean ued samples the unbiased estimator of the squared distances). Instead of the classical method of double population distance covariance can be computed by centering A, we introduce the U-centered matrix Ã. an O(n log n) algorithm. The supplementary files to The (i,j)-th entry of à is Ref. 32 include an implementation in Matlab. 8 > Xn Xn Xn < − 1 − 1 1 e ai,j ai,j ai, j + ai,j, i 6¼ j; A = n−2 n−2 ðÞn−1 ðÞn−2 Distance Correlation for Dissimilarity i, j :> i =1 j =1 i,j =1 0, i = j: Matrices It is important to notice that à does not change if we Here ‘U-centered’ refers to the result that the corre- add the same constant to all off-diagonal entries and sponding squared distance covariance statistic is an U-center the result. In Ref 31, it is shown that the inner unbiased estimator of the population coefficient. product version of dCov can be applied to any U-

Volume8,January/February2016 ©2015WileyPeriodicals,Inc. 33 Advanced Review wires.wiley.com/compstats

centered dissimilarity matrix (zero-diagonal symmetic stochastically, and therefore En determines a consist- matrix). The algorithm is outlined in Ref 31, and it is a ent goodness-of-fit test. strong competitor of the Mantel test for association For most applications the exponent α =1 between dissimilarity matrices based on comparisons (Euclidean distance) can be applied, but smaller in the paper. This makes dCov tests and energy statis- exponents have been applied for testing distributions tics ready to apply to problems in e.g., community with heavy tails including Pareto, Cauchy, and stable ecology, where one must often work with data in the distributions.14,15 The important special case of test- form of non-Euclidean dissimilarity matrices. ing multivariate normality8,9 is fully implemented in the energy24 package for R. Partial Distance Correlation A detailed introduction to the energy goodness- fi Based on the inner product dCov statistic (the unbi- of- t tests with several simple examples can be found in Ref 3. Here we will focus on the important appli- ased estimator of V2) theory is developed to define cations of testing for multivariate normality, starting partial distance correlation analogous to (linear) par- with the special case of univariate normality. tial correlation. There is a simple computing formula The energy statistic for testing whether a sam- for the pdCor statistic and there is a test for the ple X , …, X is from a multivariate normal distribu- hypothesis of zero pdCor based on the inner product. 1 n tion N(μ, Σ) is developed by Székely and Rizzo.9 Let Energy statistics are defined for random vectors, so x , …, x denote an observed random sample. pdCor(X, Y; Z) is a scalar coefficient defined for ran- 1 n dom vectors X, Y, and Z in arbitrary dimension. The Univariate Normality statistics and tests are described in detail in Ref. 31 For a test of univariate normality, we apply the sta- and currently implemented in an R package pdcor,33 tistic Eq. (6). Suppose that the null hypothesis is that which will become part of the energy package. the sample has a with mean μ and variance σ2. Then it can be derived that

2 GOODNESS-OF-FIT E k xi −X k =2ðÞxi −μ FxðÞi +2σ fxðÞi −ðÞxi −μ ; σ fi E − 0 2ffiffiffi; Goodness-of- t is a one-sample problem, but there k X X k = pπ are two distributions to consider: one is the hypothesized distribution and the other is the under- where F and f denote the cdf and density of the lying distribution from which the observed sample hypothesized Normal(μ, σ2) distribution. The last has been drawn. Energy distance applies to compare sum in the statistic E can be linearized in terms of these two distributions with a variation of the two n the ordered sample, which allows computation in sample energy distance. O(n log n) time. The energy distance for this problem must be Generally, the parameters μ and σ are unknown. the same as E(X, Y) where one of the variables now In that case we first standardize the sample using the represents the unknown sampled distribution. Sup- sample mean and the sample standard deviation, then pose that a random sample x , …, x is observed 1 n test the fit to the standard normal distribution. The and the problem is to test whether the sampled distri- estimated parameters change the critical values of the bution F is equal to the hypothesized distribution F . X distribution but not the general shape, and the rejec- The energy goodness-of-fit statistic is 0 1 tion region is in the upper tail. See Ref 8 for a detailed Xn Xn Xn @2 α α 1 αA proof that the test with estimated parameters is statis- E = n E k x −Xk −E k X−X0k − k x −x k ; n n i 2 i j tically consistent (also for the multivariate case) i =1 n i =1j =1 against all alternatives with finite moments. ð6Þ

0 where X and X are iid with distribution FX, and 0<α < 2. The statistic is defined in arbitrary dimen- Multivariate Normality sion and is not restricted by sample size. The only The statistic En for testing multivariate normality is required condition is that kXk has finite α moment considerably more difficult to derive. We first stand- under the null hypothesis. Under the null hypothesis ardize the sample using the sample mean vector and 0 α EEn = E k X−X k , and the asymptotic distribution of the sample covariance matrix to estimate parameters. En is a quadratic form of centered Gaussian random Then we test the sample for fit to standard multivari- variables. The rejection region is in the upper tail. ate normal. If Z and Z0 are iid standard normal in Under an alternative hypothesis, En tends to infinity dimension d, we have

34 ©2015WileyPeriodicals,Inc. Volume8,January/February2016 WIREs Computational Statistics Energy distance

pffiffiffi Γ d +1 n are generated, j =1,…, M, each standardized E k Z−Z0k = 2E k Zk =2 2 ; d d Γ d using the j-th sample mean vector and sample covari- 2 EðÞj ance, and n,d is computed for each of these samples where Γ() is the complete gamma function. If to obtain a reference distribution. An estimated fi y1, …, yn are the standardized sample elements, the p-value is obtained by nding the proportion of repli- computing formula for the test of multivariate nor- EðÞj E cates n,d that exceed the observed n,d statistic. An mality in ℝd is 0 1 example illustrating the test of normality for the d +1 B Xn Γ Xn C four-dimensional iris setosa data (included in the R E @2 E − − 2 − 1 − A n n,d = n k yj Zkd 2 k yj ykkd distribution) follows. In this example, the hypothesis n Γ d n2 j =1 2 j,k =1 of normality is rejected at significance level 0.05.

Under the null hypothesis, n En,d converges where X∞  in distribution to a quadratic form Q = λ Z2, ffiffiffi d i =1 i i p d +1 ∞ 2Γ as n ! , where Zi are iid standard normal random 2  variables and λi are non-negative constants that E k a−Zkd = Γ d depend on the parameters of the null distribution. 2  Figure 1 displays replicates for testing the iris data E fi rffiffiffi Γ d +1 Γ 3 with the observed n,d identi ed by the large black X∞ k 2k +2 k + 2 ðÞ−1 k ak 2 2 dot. The density curve overlaid on the plot is an + d : π k!2k ðÞ2k +1 ðÞ2k +2 d approximation (n = 50) of the density of the asymp- k =0 Γ k + +1 2 totic distribution Qd.

The expression for E k a − Z k d follows from the fact that if Z is a d-variate standard normal random − 2 4 vector, then k a Z kd has a noncentral chisquare dis- tribution χ2[ν; λ] with noncentrality parameter λ = jaj2=2, and degrees of freedom ν = d +2ψ, where d 3 ψ is a Poisson random variable with mean λ. Typi- cally the sum in E k a − Z k d converges to within a small tolerance after 40–60 terms, except when a is a true outlier of the standard multivariate normal dis- 2 tribution (when k a k is very large). However, Density E k a−Z k converges to k a k as k a k!∞,sowe can evaluate E k a−Z k ≊ k a k in that case. See the 1 source code in ‘energy.c’ of the energy package24 for an implementation. The test is implemented for the multivariate 0 normal distribution as follows, in the energy pack- age24 with the mvnorm.etest function. If the 0.8 1.0 1.2 1.4 observed sample size is n, it is standardized using the E sample mean vector and sample covariance, and the FIGURE 1 | Replicates generated under the null hypothesis in a E observed test statistic n,d computed. A large number test of multivariate normality for the iris setosa data. The test statistic

M of standard multivariate normal samples of size En,d of the observed iris sample is located by the black dot.

Volume8,January/February2016 ©2015WileyPeriodicals,Inc. 35 Advanced Review wires.wiley.com/compstats

The energy goodness-of-fit test could alternately One can further generalize energy distance to be implemented by evaluating the constants λi,but probability distributions on metric spaces. Consider a that is a difficult problem except in special cases. Some metric space (M, d) which has Borel sigma algebra analytical results for univariate normality are given in B(M), so (M, B(M)) is a measurable space. Let PðÞM Ref 3 and a numerical approach investigated in Ref denote the set of probability measures on (M, B(M)). 14. See also Ref 8 for tabulated critical values of nEb Then for any pair of probability measures μ and ν in n,d fi for several (n, d) obtained by large scale simulation. PðÞM , we can de ne the energy distance in terms of The energy test of multivariate normality is the associated random variables X and Y and the practical to apply via parametric bootstrap as illus- metric d as the square root of trated above for arbitrary dimension, and sample size 2 n < d is not a problem. Monte Carlo power compari- D ðÞμ,ν =2E½dXðÞ,Y −E½dXðÞ,X0 −E½dYðÞ,Y0 ; sons9 suggest that the energy test is a powerful com- 2 petitor to other tests of multivariate normality. provided that D (μ, ν) ≥ 0. However, in general 2 Indeed, there are very few other tests in the literature D (μ, ν) can be negative. In order that D is a metric, for multivariate normality like energy that are con- it is necessary and sufficient that (M,d) is strongly 20 sistent, powerful, omnibus tests with practical imple- negative definite (see Lyons ). When (M,d)is mentation; the BHEP tests,34,35 which also apply a strongly negative definite, then the energy distance characterization of equality between distributions, D(μ, ν) equals zero if and only if the distributions are share these properties, and have recently been imple- equal. A commonly applied metric that is negative mented in an R package MVN. See Ref 9 for definite but not strongly negative definite is the taxi- 2 20 comparisons. cab metric in ℝ . Lyons showed that all separable Hilbert spaces (and in particular Euclidean spaces) have strong negative type. GENERALIZATIONS α One might wonder what makes the functions | | , 0<α < 2 special in the definition above. One can α CONCLUSION show that the key property is that | | ,0<α <2is strongly negative definite (see Lyons20). In this case, Energy distance is a powerful tool for multivariate the generalized distance function remains a metric for analysis. It applies to random vectors in arbitrary measuring the distance between probability distribu- dimensions, and the methodology requires only the tions F and G, it is nonnegative and equals zero if mild assumption of finite first moments or at least and only if F = G. What makes the distance function finite α > 0 moments for some positive α. Computing special in the class of strongly negative definite func- formulas are simple and the tests have been imple- tions is that the distance function is scale equivariant. mented by nonparametric methods using resampling If we change the scale by replacing x by cx and y by or Monte Carlo methods. We have illustrated the use cy, then the squared distance D2(F, G) is multiplied of the functions in the energy package24 for several by c, and therefore the ratio of these functions does of the methods. The package is open source and dis- not depend on the constant c. tributed under general public license. To scale up to This property of invariance also holds for big data problems, for problems such as cluster anal- 0<α < 2, because if we change the measurement ysis, one could apply a ‘divide and recombine’ α units in D2( )(F, G), replace X by cX and Y by cY, (D&R) analysis. α α then D2( )(F, G) is multiplied by c and the ratio of Readers may also refer to several interesting two statistics of this type is again invariant with applications in a variety of disciplines, under Further respect to c. Hence the statistical decisions based on Reading. Our review of the background and applica- these ratios do not depend on the choice of measure- tions of energy distance is not an exhaustive bibliog- ment units. This invariance property is essential. raphy, but intended as a starting point.

FURTHER READING Dueck J, Edelmann D, Gneiting T, Richards D. The affinely invariant distance correlation. Bernoulli 2014, 20:2305–2330. Feuerverger A. A consistent test for bivariate dependence. Int Stat Rev 1993, 61:419–433.

36 ©2015WileyPeriodicals,Inc. Volume8,January/February2016 WIREs Computational Statistics Energy distance

Székely GJ, Rizzo ML. On the uniqueness of distance covariance. Stat Probab Lett 2012, 82:2278–2282. doi:10.1016/j. spl.2012.08.007. Wahba G. Positive definite functions, reproducing kernel Hilbert spaces and all that. The Fisher Lecture at JSM 2014, 2014. Available at: http://www.stat.wisc.edu/wahba/talks1/fisher.14/wahba.fisher.7.11.pdf. Gneiting T, Raftery AE. Strictly proper scoring rules, prediction, and estimation. J Am Stat Assoc 2007, 102:359–378. doi:10.1198/016214506000001437. Gretton A. A simpler condition for consistency of a kernel independence test, 2015. Available at: http://arxiv.org/abs/ 1501.06103v1. Gretton A, Györfi L. Consistent nonparametric tests of independence. J Mach Learn Res 2010, 11:1391–1423. Kong J, Wang S, Wahba G. Using distance covariance for improved variable selection with application to learning genetic risk models. Stat Med 2015, 34/10:1097–1258. doi:10.1002/sim.6441. Kong J, Klein BEK, Klein R, Lee KE, Wahba G. Using distance correlation and SS-ANOVA to assess associations of familial relationships, lifestyle factors, diseases, and mortality. Proc Natl Acad Sci USA 2012, 109:20352–20357. doi:10.1073/ pnas.1217269109. Li R, Zhong W, Zhu L. Feature screening via distance correlation learning. J Am Stat Assoc 2012, 107:1129–1139. doi:10.1080/01621459.2012.695654. Martinez-Gomez E, Richards MT, Richards DSP. Distance correlation methods for discovering associations in large astro- physical databases. Astrophys J 2014, 781:39. Kim AY, Marzban C, Percival DB, Stuetzle W. Using labeled data to evaluate change detectors in a multivariate streaming environment. Signal Process 2009, 89/12:2529–2536. doi:10.1016/j.sigpro.2009.04.011. Menshenin DD, Zubkov AM. Properties of the Szekely-Mori symmetry criterion statistics in the case of binary vectors. Math Notes 2012, 91:62–72. Székely GJ, Móri TF. A characteristic measure of asymmetry and its application for testing diagonal symmetry. Commun Stat Theory Methods 2001, 30:1633–1639. Székely GJ, Bakirov NK. Extremal probabilities for Gaussian quadratic forms. Probab Theory Relat Fields 2003, 126:184–202. Varin T, Bureau R, Mueller C, Willett P. Clustering files of chemical structures using the Szekely-Rizzo generalization of Ward’s method. J Mol Graphics Modell 2009, 28/2:187–195. doi:10.1016/j.jmgm.2009.06.006. Zhou Z. Measuring nonlinear dependence in time-series, a distance correlation approach. J Time Ser Anal 2012, 33:438–457.

REFERENCES 1. Székely GJ. Potential and Kinetic Energy in Statistics, some statistics in connection with some probability Lecture Notes, Budapest Institute of Technology metrics, In: Stability Problems for Stochastic (Technical University), 1989. Models. Moscow, VNIISI, 1989, 47–55. (in Russian), 2. Székely GJ. E-statistics: The Energy of Statistical Sam- English Translation: A characterization of ples, Technical Report, Bowling Green State Univer- distributions by mean values of statistics and certain sity, Department of Mathematics and Statistics probabilistic metrics, Journal of Soviet Mathematics No. 02–16 and Technical Reports by the same title (1992), 2012. from 2000–2003, e.g. No.03-05 and NSA grant # 6. Mattner L. Strict negative definiteness of integrals via MDA 904-02-1-s0091 (2000–2002), 2002. complete monotonicity of derivatives. Trans Am Math 3. Székely GJ, Rizzo ML. Energy statistics: statistics based Soc 1997, 349:3321–3342. on distances. J Stat Plann Infer 2013, 143:1249–1272. 7. Morgenstern D. Proof of a conjecture by Walter Deub- doi:10.1016/j.jspi.2013.03.018. ner concerning the distance between points of two 4. Cramér H. On the composition of elementary errors. types in Rd. Discrete Math 2001, 226:347–349. – Skand Aktuar 1928, 11:141 180. 8. Rizzo ML. A new rotation invariant goodness-of-fit 5. Zinger AA, Kakosyan AV, Klebanov LB. Characteriza- test, PhD dissertation, Bowling Green State Univer- tion of distributions by means of mean values of sity, 2002.

Volume8,January/February2016 ©2015WileyPeriodicals,Inc. 37 Advanced Review wires.wiley.com/compstats

9. Székely GJ, Rizzo ML. A new test for multivariate nor- 23. Klebanov LB. N-distances and Their mality. J Multivar Anal 2005, 93:58–80. Applications. Charles University, Prague: Karolinum 10. Baringhaus L, Franz C. On a new multivariate two- Press; 2005. sample test. J Multivar Anal 2004, 88:190–206. 24. Rizzo ML, Székely GJ. Energy: E-statistics (energy sta- 11. Rizzo ML. A test of homogeneity for two multivariate tistics). R package version 1.6.2, 2014. Available at: populations. In: 2002 Proceedings of the American http://CRAN.R-project.org/package=energy. Statistical Association, Physical and Engineering 25. R Core Team. R: A language and environment for sta- Sciences Section. Alexandria, VA: American Statistical tistical computing. R Foundation for Statistical Com- Association, 2003. puting, Vienna, Austria, 2015. Available at: http:// 12. Szekely GJ, Rizzo ML. Testing for equal distributions www.R-project.org/. in high dimension. InterStat 2004, Nov. 26. Efron B, Tibshirani RJ. An Introduction to the 13. Rizzo ML, Székely GJ. DISCO analysis: a nonparamet- Bootstrap. Boca Raton, FL: Chapman & Hall/ ric extension of analysis of variance. Ann Appl Stat CRC; 1993. 2010, 4:1034–1055. 27. Davison AC, Hinkley DV. Bootstrap Methods and 14. Yang G. The energy goodness-of-fit test for univariate their Application. Oxford: Cambridge University stable distributions. PhD Thesis, Bowling Green State Press; 1997. University, 2012. 15. Rizzo ML. New goodness-of-fit tests for Pareto distri- 28. Székely GJ, Rizzo ML, Bakirov NK. Measuring butions. ASTIN Bull 2009, 39:691–715. and testing independence by correlation of distances. Ann Stat 2007, 35:2769–2794. doi:10.1214/ 16. Li Y. Goodness-of-fit tests for Dirichlet distributions 009053607000000505. with applications. PhD Thesis, Bowling Green State University, 2015. 29. Székely GJ, Rizzo ML. Brownian distance covariance. – 17. Székely GJ, Rizzo ML. Hierarchical clustering Ann Appl Stat 2009, 3:1236 1265. doi:10.1214/09- via joint between-within distances: extending ward’s AOAS312. minimum variance method. J Classif 2005, 30. Székely GJ, Rizzo ML. The distance correlation t- 22:151–183. test of independence in high dimension. J 18. Li S. k-groups: a generalization of k-means by energy Multivar Anal 2013, 117:193–213. doi:10.1016/j. distance. PhD Thesis, Bowling Green State Univer- jmva.2013.02.012. sity, 2015. 31. Székely GJ, Rizzo ML. Partial distance correlation 19. Baringhaus L, Franz C. Rigid motion invariant two- with methods for dissimilarities. Ann Stat 2014, sample tests. Stat Sin 2010, 20:1333–1361. 42:2382–2412. 20. Lyons R. Distance covariance in metric spaces. Ann 32. Huo X, Székely G. Fast computing for distance Probab 2013, 41:3284–3305. covariance. Technometrics 2015. doi:10.1080/ 21. Sejdinovic D, Gretton A, Sriperumbudur B, 00401706.2015.1054435. Fukumizu K. Hypothesis testing using pairwise dis- 33. Rizzo ML, Székely GJ. pdcor: Partial distance correla- tances and associated kernels. In: Proceedings of the tion. R package version 1.0.0, 2014. 29th International Conference on Machine Learning, Edinburgh, Scotland, UK, 2012. Available at: http:// 34. Henze N, Zirkler B. A class of invariant consistent arxiv.org/abs/1205.0411v2. tests for multivariate normality. Commun Stat Theory Methods 1990, 19:3595–3618. 22. Sejdinovic D, Sriperumbudur B, Gretton A, Fukumizu K. Equivalence of distance-based and 35. Henze N, Wagner T. A New Approach to the BHEP RKHS-based statistics in hypothesis testing. Ann Stat tests for multivariate normality. J Multivar Anal 1997, 2013, 41:2263–2291. 62:1–23.

38 ©2015WileyPeriodicals,Inc. Volume8,January/February2016