Arxiv:2002.01615V3 [Stat.ML] 10 Feb 2021 Yt University Kyoto Eli Aiu Xeietlstig Ta at Settings Experimental Various in Well Itself

Total Page:16

File Type:pdf, Size:1020Kb

Arxiv:2002.01615V3 [Stat.ML] 10 Feb 2021 Yt University Kyoto Eli Aiu Xeietlstig Ta at Settings Experimental Various in Well Itself Fast and Robust Comparison of Probability Measures in Heterogeneous Spaces Ryoma Sato Marco Cuturi Makoto Yamada Hisashi Kashima Kyoto University CREST-ENSAE Kyoto University Kyoto University RIKEN AIP Google Brain RIKEN AIP RIKEN AIP Abstract 1 Introduction Comparing two probability measures sup- Wasserstein distances have proved useful to compare ported on heterogeneous spaces is an two probability distributions when they are both increasingly important problem in machine supported in the same metric space. This is ex- learning. Such problems arise when compar- emplified by its several applications, notably in im- ing for instance two populations of biological age (Ni et al., 2009; Rabin et al., 2011; De Goes et al., cells, each described with its own set of 2012; Schmitz et al., 2018) and natural language pro- features, or when looking at families of cessing (Kusner et al., 2015; Rolet et al., 2016), neu- word embeddings trained across different roimaging (Janati et al., 2020), single-cell biology corpora/languages. For such settings, the (Schiebinger et al., 2019; Yang et al., 2020), and when Gromov Wasserstein (GW) distance is training generative models (Arjovsky et al., 2017; often presented as the gold standard. GW Salimans et al., 2018; Genevay et al., 2018). Yet, is intuitive, as it quantifies whether one when the two distributions live in two different and measure can be isomorphically mapped to seemingly unrelated spaces, simple Wasserstein dis- the other. However, its exact computation tances fail and one must to resort to the more in- is intractable, and most algorithms that volved framework of Gromov-Wasserstein (GW) dis- claim to approximate it remain expensive. tances (Mémoli, 2011). Building on Mémoli (2011), who proposed Generality of GW. The GW distance replaces the to represent each point in each distribution linear objective appearing in optimal transport (OT) as the 1D distribution of its distances to all with a quadratic function of the transportation plan other points, we introduce in this paper the that quantifies some form of metric distortion when Anchor Energy (AE) and Anchor Wasser- transporting points from one space to another. GW stein (AW) distances, which are respectively is not only an elegant answer to this problem, but the energy and Wasserstein distances in- it is also well grounded in theory (Sturm, 2012) arXiv:2002.01615v3 [stat.ML] 10 Feb 2021 stantiated on such representations. Our and has been successfully applied to shape matching main contribution is to propose a sweep (Mémoli, 2007; Solomon et al., 2016), machine trans- line algorithm to compute AE exactly in lation (Alvarez-Melis and Jaakkola, 2018; Grave et al., log-quadratic time, where a naive implemen- 2019), and graph matching (Xu et al., 2019b,a). The tation would be cubic. This is quasi-linear GW distance compares two distributions supported on w.r.t. the description of the problem itself. spaces that are endowed with a metric, or more gen- Our second contribution is the proposal of erally, a cost structure. This “(distribution, metric)” robust variants of AE and AW that uses pair is called a measured metric spaces (MMS), which rank statistics rather than the original dis- reduces, in a discrete setting to a probability vector of tances. We show that AE and AW perform size n paired with an n × n cost matrix. well in various experimental settings at a fraction of the computational cost of popular GW approximations. Although well grounded in GW approximations. Code is available at theory, the GW geometry is not tractable computa- https://github.com/joisino/anchor-energy. tionally: since its exact computation requires solv- ing an NP-hard quadratic assignment problem (QAP) (Mémoli, 2007), practical applications rely on approx- imation schemes to approach GW. Aflalo et al. (2015) Fast and Robust Comparison of Probability Measures in Heterogeneous Spaces relaxed the problem into a convex quadratic program, We also consider the entropic regularized Wasserstein while Kezurer et al. (2015) relaxed it to a semidefinite distance of the distributions of anchor 1D distributions, program. Although tractable for small n, these relax- an call this variant the Anchor Wasserstein (AW) dis- ations have scalability issues that prevent their use be- tance. Our second contribution is to introduce a rank- yond a few hundred points. Solomon et al. (2016) and based approach in which raw distance values in a MMS Peyré et al. (2016) proposed to modify the QAP using are replaced by their ranks, relative to all pairwise entropic regularization. When comparing two MMSs distances described in the MMS. We show that the supported on n points, this results in an algorithm AE and AW distances improve on entropic GW in 3D that alternates between a local linearization of the GW shape comparison and node assignment tasks, while be- objective (requiring two matrix multiplications at a ing order of magnitudes faster and applicable to larger 2n3 cost) followed by Sinkhorn iterations (for a O(n2) scales. cost). This combination of outer/inner loops can prove too costly for large n, and can suffer from numerical 2 Optimal Transport and Anchors instability due to the fact that the scale of the cost matrix can change at each linearization, making the Sinkhorn kernel application numerically unstable. To We call any measure µ on a measurable space X R bring down computational costs further, Vayer et al. endowed with a cost function X×X → a mea- (2019) proposed to restrict the GW problem to mea- sured metric space (MMS). Equivalently, a discrete sures supported on Euclidean spaces, and to general- measured metric set (MMSet) of size n is a pair Rn×n ize the slicing approach (Rabin et al., 2011) to obtain S = (a, C) ∈ Σn × , where, for a given n ≥ 1, a Rn a a cheaper O(n log n) computational price, pending ad- Σn = { ∈ + | i i = 1} is the n-probability sim- plex. Intuitively an MMSet is a probability vector a, ditional efforts to rotate/register point clouds. This P lower complexity if obtained by placing restrictions on putting mass on points seen exclusively from the lens the data, since it cannot handle non-Euclidean data or of an n × n pairwise cost matrix C. Such pairs of costs (graphs, geodesic distances of shapes, and more weights/cost matrices typically arise when describing general arbitrary kernels and costs). a weighted point cloud, along with the shortest path distance matrix induced from a graph on that cloud, or From Points to Distributions of Costs. An ef- more generally any other cost function (Solomon et al., fective way to compare two points living in different 2016; Peyré et al., 2016). We assume no knowledge on MMSs could be achieved by giving them a common the points that constitute an MMSet, and only work feature representation. Such a representation can be from C (the SGW framework assumes that C is a obtained by mapping each point from an MMS onto squared-Euclidean distance matrix). Before describ- the 1D distribution of all its distances/costs to all ing the GW distance between two MMSets, we review other points in its MMS (hence the characterization the standard Wasserstein distance and quadratic as- of such points as “anchors”). This approach was in- signments. troduced in computer graphics and vision to compare 3D shapes (Osada et al., 2002; Hamza and Krim, 2003; Wasserstein Distance. Given two measurable R Gelfand et al., 2005; Manay et al., 2006) and linked spaces X and Y, a cost function c: X×Y→ , and with the GW problem by Mémoli (2011). As a re- two measures µ ∈ P(X ) and ν ∈ P(Y), the optimal sult, two points from homogeneous sources can be transport problem between µ and ν can be written as compared through their respective cost distributions E R OT(µ,ν) = min (X,Y )∼γ[c(X, Y )], in P( ). Consequently, MMSs can be represented as γ∈Π(µ,ν) elements of P(P(R)) and compared with a suitable metric on that space. where Π(µ,ν) is the set of joint couplings on X × Y Contributions. We focus first on the energy distance that have marginals µ,ν. When X = Y and the cost c (Székely and Rizzo, 2013; Sejdinovic et al., 2013) com- is a metric on X raised to the power p (with p ≥ 1), the 1/p puted between measures in P(P(R)), using the 1- optimum OT(µ,ν) is called the p-Wasserstein met- Wasserstein distance as a negative definite kernel to ric between µ and ν. We consider next two subcases compare atoms in P(R) (Wasserstein distances are that are relevant for the remainder of this paper. R 1 2 indeed n.d. in P( ), as remarked by Kolouri et al. When a , a ∈ Σn (i.e., discrete), a cost matrix C ∈ (2016)). We call this approach the Anchor Energy Rn×n suffices to instantiate the OT problem: (AE) distance, and our first contribution is to propose an efficient approach to compute it in O(n2 log n) time, 1 2 OT(a , a ) = min Pij Cij , where P∈U(a1,a2) resulting in quasi-linear complexity w.r.t. the input ij 2 X size (an MMS is described with an n cost matrix). 1 2 n×m 1 ⊤ 2 ½ R ½ U(a , a )= {P ∈ + | P m = a , P n = a }, Ryoma Sato, Marco Cuturi, Makoto Yamada, Hisashi Kashima is the transportation polytope and ½m is the m-vector distances as SGW, but without relying on a projection of ones. When µ,ν ∈ P(R) and c(x, y) = |x − y|p, step. p ≥ 1, one has that (Santambrogio, 2015, §2) Points as Anchors. To compare two MMSets, we 1 map each point from each MMSet to the distribution −1 −1 p OTp(µ,ν)= |Hµ (u) − Hν (u)| du, of its (weighted) ground cost or distance to all other Z0 points within that MMSet.
Recommended publications
  • A Modified Energy Statistic for Unsupervised Anomaly Detection
    A Modified Energy Statistic for Unsupervised Anomaly Detection Rupam Mukherjee GE Research, Bangalore, Karnataka, 560066, India [email protected], [email protected] ABSTRACT value. Anomaly is flagged when the distance metric exceeds a threshold and an alert is generated, thereby preventing sud- For prognostics in industrial applications, the degree of an- den unplanned failures. Although as is captured in (Goldstein omaly of a test point from a baseline cluster is estimated using & Uchida, 2016), anomaly may be both supervised and un- a statistical distance metric. Among different statistical dis- supervised in nature depending on the availability of labelled tance metrics, energy distance is an interesting concept based ground-truth data, in this paper anomaly detection is consid- on Newton’s Law of Gravitation, promising simpler comput- ered to be the unsupervised version which is more convenient ation than classical distance metrics. In this paper, we re- and more practical in many industrial systems. view the state of the art formulations of energy distance and point out several reasons why they are not directly applica- Choice of the statistical distance formulation has an important ble to the anomaly-detection problem. Thereby, we propose impact on the effectiveness of RUL estimation. In literature, a new energy-based metric called the -statistic which ad- statistical distances are described to be of two types as fol- P dresses these issues, is applicable to anomaly detection and lows. retains the computational simplicity of the energy distance. We also demonstrate its effectiveness on a real-life data-set. 1. Divergence measures: These estimate the distance (or, similarity) between probability distributions.
    [Show full text]
  • Package 'Energy'
    Package ‘energy’ May 27, 2018 Title E-Statistics: Multivariate Inference via the Energy of Data Version 1.7-4 Date 2018-05-27 Author Maria L. Rizzo and Gabor J. Szekely Description E-statistics (energy) tests and statistics for multivariate and univariate inference, including distance correlation, one-sample, two-sample, and multi-sample tests for comparing multivariate distributions, are implemented. Measuring and testing multivariate independence based on distance correlation, partial distance correlation, multivariate goodness-of-fit tests, clustering based on energy distance, testing for multivariate normality, distance components (disco) for non-parametric analysis of structured data, and other energy statistics/methods are implemented. Maintainer Maria Rizzo <[email protected]> Imports Rcpp (>= 0.12.6), stats, boot LinkingTo Rcpp Suggests MASS URL https://github.com/mariarizzo/energy License GPL (>= 2) NeedsCompilation yes Repository CRAN Date/Publication 2018-05-27 21:03:00 UTC R topics documented: energy-package . .2 centering distance matrices . .3 dcor.ttest . .4 dcov.test . .6 dcovU_stats . .8 disco . .9 distance correlation . 12 edist . 14 1 2 energy-package energy.hclust . 16 eqdist.etest . 19 indep.etest . 21 indep.test . 23 mvI.test . 25 mvnorm.etest . 27 pdcor . 28 poisson.mtest . 30 Unbiased distance covariance . 31 U_product . 32 Index 34 energy-package E-statistics: Multivariate Inference via the Energy of Data Description Description: E-statistics (energy) tests and statistics for multivariate and univariate inference, in- cluding distance correlation, one-sample, two-sample, and multi-sample tests for comparing mul- tivariate distributions, are implemented. Measuring and testing multivariate independence based on distance correlation, partial distance correlation, multivariate goodness-of-fit tests, clustering based on energy distance, testing for multivariate normality, distance components (disco) for non- parametric analysis of structured data, and other energy statistics/methods are implemented.
    [Show full text]
  • Independence Measures
    INDEPENDENCE MEASURES Beatriz Bueno Larraz M´asteren Investigaci´one Innovaci´onen Tecnolog´ıasde la Informaci´ony las Comunicaciones. Escuela Polit´ecnica Superior. M´asteren Matem´aticasy Aplicaciones. Facultad de Ciencias. UNIVERSIDAD AUTONOMA´ DE MADRID 09/03/2015 Advisors: Alberto Su´arezGonz´alez Jo´seRam´onBerrendero D´ıaz ii Acknowledgements This work would not have been possible without the knowledge acquired during both Master degrees. Their subjects have provided me essential notions to carry out this study. The fulfilment of this Master's thesis is the result of the guidances, suggestions and encour- agement of professors D. Jos´eRam´onBerrendero D´ıazand D. Alberto Su´arezGonzalez. They have guided me throughout these months with an open and generous spirit. They have showed an excellent willingness facing the doubts that arose me, and have provided valuable observa- tions for this research. I would like to thank them very much for the opportunity of collaborate with them in this project and for initiating me into research. I would also like to express my gratitude to the postgraduate studies committees of both faculties, specially to the Master's degrees' coordinators. All of them have concerned about my situation due to the change of the normative, and have answered all the questions that I have had. I shall not want to forget, of course, of my family and friends, who have supported and encouraged me during all this time. iii iv Contents 1 Introduction 1 2 Reproducing Kernel Hilbert Spaces (RKHS)3 2.1 Definitions and principal properties..........................3 2.2 Characterizing reproducing kernels..........................6 3 Maximum Mean Discrepancy (MMD) 11 3.1 Definition of MMD..................................
    [Show full text]
  • The Energy Goodness-Of-Fit Test and E-M Type Estimator for Asymmetric Laplace Distributions
    THE ENERGY GOODNESS-OF-FIT TEST AND E-M TYPE ESTIMATOR FOR ASYMMETRIC LAPLACE DISTRIBUTIONS John Haman A Dissertation Submitted to the Graduate College of Bowling Green State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY August 2018 Committee: Maria Rizzo, Advisor Joseph Chao, Graduate Faculty Representative Wei Ning Craig Zirbel Copyright c 2018 John Haman All rights reserved iii ABSTRACT Maria Rizzo, Advisor Recently the asymmetric Laplace distribution and its extensions have gained attention in the statistical literature. This may be due to its relatively simple form and its ability to model skew- ness and outliers. For these reasons, the asymmetric Laplace distribution is a reasonable candidate model for certain data that arise in finance, biology, engineering, and other disciplines. For a prac- titioner that wishes to use this distribution, it is very important to check the validity of the model before making inferences that depend on the model. These types of questions are traditionally addressed by goodness-of-fit tests in the statistical literature. In this dissertation, a new goodness-of-fit test is proposed based on energy statistics, a widely applicable class of statistics for which one application is goodness-of-fit testing. The energy goodness-of-fit test has a number of desirable properties. It is consistent against general alter- natives. If the null hypothesis is true, the distribution of the test statistic converges in distribution to an infinite, weighted sum of Chi-square random variables. In addition, we find through simula- tion that the energy test is among the most powerful tests for the asymmetric Laplace distribution in the scenarios considered.
    [Show full text]
  • From Distance Correlation to Multiscale Graph Correlation Arxiv
    From Distance Correlation to Multiscale Graph Correlation Cencheng Shen∗1, Carey E. Priebey2, and Joshua T. Vogelsteinz3 1Department of Applied Economics and Statistics, University of Delaware 2Department of Applied Mathematics and Statistics, Johns Hopkins University 3Department of Biomedical Engineering and Institute of Computational Medicine, Johns Hopkins University October 2, 2018 Abstract Understanding and developing a correlation measure that can detect general de- pendencies is not only imperative to statistics and machine learning, but also crucial to general scientific discovery in the big data age. In this paper, we establish a new framework that generalizes distance correlation — a correlation measure that was recently proposed and shown to be universally consistent for dependence testing against all joint distributions of finite moments — to the Multiscale Graph Correla- tion (MGC). By utilizing the characteristic functions and incorporating the nearest neighbor machinery, we formalize the population version of local distance correla- tions, define the optimal scale in a given dependency, and name the optimal local correlation as MGC. The new theoretical framework motivates a theoretically sound ∗[email protected] arXiv:1710.09768v3 [stat.ML] 30 Sep 2018 [email protected] [email protected] 1 Sample MGC and allows a number of desirable properties to be proved, includ- ing the universal consistency, convergence and almost unbiasedness of the sample version. The advantages of MGC are illustrated via a comprehensive set of simula- tions with linear, nonlinear, univariate, multivariate, and noisy dependencies, where it loses almost no power in monotone dependencies while achieving better perfor- mance in general dependencies, compared to distance correlation and other popular methods.
    [Show full text]
  • Level Sets Based Distances for Probability Measures and Ensembles with Applications Arxiv:1504.01664V1 [Stat.ME] 7 Apr 2015
    Level Sets Based Distances for Probability Measures and Ensembles with Applications Alberto Mu~noz1, Gabriel Martos1 and Javier Gonz´alez2 1Department of Statistics, University Carlos III of Madrid Spain. C/ Madrid, 126 - 28903, Getafe (Madrid), Spain. [email protected], [email protected] 2Sheffield Institute for Translational Neuroscience, Department of Computer Science, University of Sheffield. Glossop Road S10 2HQ, Sheffield, UK. [email protected] ABSTRACT In this paper we study Probability Measures (PM) from a functional point of view: we show that PMs can be considered as functionals (generalized func- tions) that belong to some functional space endowed with an inner product. This approach allows us to introduce a new family of distances for PMs, based on the action of the PM functionals on `interesting' functions of the sample. We propose a specific (non parametric) metric for PMs belonging to this class, arXiv:1504.01664v1 [stat.ME] 7 Apr 2015 based on the estimation of density level sets. Some real and simulated data sets are used to measure the performance of the proposed distance against a battery of distances widely used in Statistics and related areas. 1 Introduction Probability metrics, also known as statistical distances, are of fundamental importance in Statistics. In essence, a probability metric it is a measure that quantifies how (dis)similar are two random quantities, in particular two probability measures (PM). Typical examples of the use of probability metrics in Statistics are homogeneity, independence and goodness of fit tests. For instance there are some goodness of fit tests based on the use of the χ2 distance and others that use the Kolmogorov-Smirnoff statistics, which corresponds to the choice of the supremum distance between two PMs.
    [Show full text]
  • The Perfect Marriage and Much More: Combining Dimension Reduction, Distance Measures and Covariance
    The Perfect Marriage and Much More: Combining Dimension Reduction, Distance Measures and Covariance Ravi Kashyap SolBridge International School of Business / City University of Hong Kong July 15, 2019 Keywords: Dimension Reduction; Johnson-Lindenstrauss; Bhattacharyya; Distance Measure; Covariance; Distribution; Uncertainty JEL Codes: C44 Statistical Decision Theory; C43 Index Numbers and Aggregation; G12 Asset Pricing Mathematics Subject Codes: 60E05 Distributions: general theory; 62-07 Data analysis; 62P20 Applications to economics Edited Version: Kashyap, R. (2019). The perfect marriage and much more: Combining dimension reduction, distance measures and covariance. Physica A: Statistical Mechanics and its Applications, XXX, XXX-XXX. Contents 1 Abstract 4 2 Introduction 4 3 Literature Review of Methodological Fundamentals 5 3.1 Bhattacharyya Distance . .5 3.2 Dimension Reduction . .9 4 Intuition for Dimension Reduction 10 arXiv:1603.09060v6 [q-fin.TR] 12 Jul 2019 4.1 Game of Darts . 10 4.2 The Merits and Limits of Four Physical Dimensions . 11 5 Methodological Innovations 11 5.1 Normal Log-Normal Mixture . 11 5.2 Normal Normal Product . 13 5.3 Truncated Normal / Multivariate Normal . 15 5.4 Discrete Multivariate Distribution . 19 5.5 Covariance and Distance . 20 1 6 Market-structure, Microstructure and Other Applications 24 6.1 Asset Pricing Application . 25 6.2 Biological Application . 26 7 Empirical Illustrations 27 7.1 Comparison of Security Prices across Markets . 27 7.2 Comparison of Security Trading Volumes, High-Low-Open-Close Prices and Volume-Price Volatilities . 30 7.3 Speaking Volumes Of: Comparison of Trading Volumes . 30 7.4 A Pricey Prescription: Comparison of Prices (Open, Close, High and Low) .
    [Show full text]
  • An Efficient and Distribution-Free Two-Sample Test Based on Energy Statistics and Random Projections
    An Efficient and Distribution-Free Two-Sample Test Based on Energy Statistics and Random Projections Cheng Huang and Xiaoming Huo July 14, 2017 Abstract A common disadvantage in existing distribution-free two-sample testing approaches is that the com- putational complexity could be high. Specifically, if the sample size is N, the computational complex- ity of those two-sample tests is at least O(N 2). In this paper, we develop an efficient algorithm with complexity O(N log N) for computing energy statistics in univariate cases. For multivariate cases, we introduce a two-sample test based on energy statistics and random projections, which enjoys the O(KN log N) computational complexity, where K is the number of random projections. We name our method for multivariate cases as Randomly Projected Energy Statistics (RPES). We can show RPES achieves nearly the same test power with energy statistics both theoretically and empirically. Numerical experiments also demonstrate the efficiency of the proposed method over the competitors. 1 Introduction Testing the equality of distributions is one of the most fundamental problems in statistics. Formally, let F and G denote two distribution function in Rp. Given independent and identically distributed samples X ,...,X and Y ,...,Y { 1 n} { 1 m} from two unknown distribution F and G, respectively, the two-sample testing problem is to test hypotheses : F = G v.s. : F = G. H0 H1 6 There are a few recent advances in two sample testing that attract attentions in statistics and machine learning communities. [17] propose a test statistic based on the optimal non-bipartite matching, and, [3] develop a test based on shortest Hamiltonian path, both of which are distribution-free.
    [Show full text]
  • A Generalization of K-Means by Energy Distance
    K-GROUPS: A GENERALIZATION OF K-MEANS BY ENERGY DISTANCE Songzi Li A Dissertation Submitted to the Graduate College of Bowling Green State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY May 2015 Committee: Maria L. Rizzo, Advisor Christopher M. Rump, Graduate Faculty Representative Hanfeng Chen Wei Ning ii ABSTRACT Maria L. Rizzo, Advisor We propose two distribution-based clustering algorithms called K-groups. Our algorithms group the observations in one cluster if they are from a common distribution. Energy distance is a non-negative measure of the distance between distributions that is based on Euclidean dis- tances between random observations, which is zero if and only if the distributions are identical. We use energy distance to measure the statistical distance between two clusters, and search for the best partition which maximizes the total between clusters energy distance. To implement our algorithms, we apply a version of Hartigan and Wong’s moving one point idea, and generalize this idea to moving any m points. We also prove that K-groups is a generalization of the K-means algo- rithm. K-means is a limiting case of the K-groups generalization, with common objective function and updating formula in that case. K-means is one of the well-known clustering algorithms. From previous research, it is known that K-means has several disadvantages. K-means performs poorly when clusters are skewed or overlapping. K-means can not handle categorical data. K-means can not be applied when dimen- sion exceeds sample size. Our K-groups methods provide a practical and effective solution to these problems.
    [Show full text]
  • Equivalence of Distance-Based and RKHS-Based Statistics In
    The Annals of Statistics 2013, Vol. 41, No. 5, 2263–2291 DOI: 10.1214/13-AOS1140 c Institute of Mathematical Statistics, 2013 EQUIVALENCE OF DISTANCE-BASED AND RKHS-BASED STATISTICS IN HYPOTHESIS TESTING By Dino Sejdinovic, Bharath Sriperumbudur, Arthur Gretton and Kenji Fukumizu University College London, University of Cambridge, University College London and Max Planck Institute for Intelligent Systems, T¨ubingen, and Institute of Statistical Mathematics, Tokyo We provide a unifying framework linking two classes of statistics used in two-sample and independence testing: on the one hand, the energy distances and distance covariances from the statistics litera- ture; on the other, maximum mean discrepancies (MMD), that is, distances between embeddings of distributions to reproducing kernel Hilbert spaces (RKHS), as established in machine learning. In the case where the energy distance is computed with a semimetric of negative type, a positive definite kernel, termed distance kernel, may be defined such that the MMD corresponds exactly to the energy distance. Conversely, for any positive definite kernel, we can inter- pret the MMD as energy distance with respect to some negative-type semimetric. This equivalence readily extends to distance covariance using kernels on the product space. We determine the class of proba- bility distributions for which the test statistics are consistent against all alternatives. Finally, we investigate the performance of the family of distance kernels in two-sample and independence tests: we show in particular that the energy distance most commonly employed in statistics is just one member of a parametric family of kernels, and that other choices from this family can yield more powerful tests.
    [Show full text]
  • The Energy of Data
    The Energy of Data G´abor J. Sz´ekely,1 and Maria L. Rizzo2 1National Science Foundation, Arlington, VA 22230 and R´enyi Institute of Mathematics, Hungarian Academy of Sciences, Hungary; email: [email protected] 2Department of Mathematics and Statistics, Bowling Green State University, Bowling Green, OH 43403; email: [email protected] Xxxx. Xxx. Xxx. Xxx. YYYY. AA:1{36 Keywords This article's doi: energy distance, statistical energy, goodness-of-fit, multivariate 10.1146/((please add article doi)) independence, distance covariance, distance correlation, U-statistic, Copyright c YYYY by Annual Reviews. V-statistic, data distance, graph distance, statistical tests, clustering All rights reserved Abstract The energy of data is the value of a real function of distances between data in metric spaces. The name energy derives from Newton's gravi- tational potential energy which is also a function of distances between physical objects. One of the advantages of working with energy func- tions (energy statistics) is that even if the observations/data are com- plex objects, like functions or graphs, we can use their real valued dis- tances for inference. Other advantages will be illustrated and discussed in this paper. Concrete examples include energy testing for normal- ity, energy clustering, and distance correlation. Applications include genome, brain studies, and astrophysics. The direct connection between energy and mind/observations/data in this paper is a counterpart of the equivalence of energy and matter/mass in Einstein's E = mc2. 1 Contents 1. INTRODUCTION ............................................................................................ 2 1.1. Distances of observations { Rigid motion invariance ................................................... 5 1.2. Functions of distances { Jackknife invariance .........................................................
    [Show full text]
  • Package 'Energy'
    Package ‘energy’ February 22, 2021 Title E-Statistics: Multivariate Inference via the Energy of Data Version 1.7-8 Date 2021-02-21 Description E-statistics (energy) tests and statistics for multivariate and univariate inference, including distance correlation, one-sample, two-sample, and multi-sample tests for comparing multivariate distributions, are implemented. Measuring and testing multivariate independence based on distance correlation, partial distance correlation, multivariate goodness-of-fit tests, k-groups and hierarchical clustering based on energy distance, testing for multivariate normality, distance components (disco) for non-parametric analysis of structured data, and other energy statistics/methods are implemented. Imports Rcpp (>= 0.12.6), stats, boot, gsl LinkingTo Rcpp Suggests MASS, CompQuadForm Depends R (>= 2.10) URL https://github.com/mariarizzo/energy License GPL (>= 2) LazyData true NeedsCompilation yes Repository CRAN Author Maria Rizzo [aut, cre], Gabor Szekely [aut] Maintainer Maria Rizzo <[email protected]> Date/Publication 2021-02-22 16:30:02 UTC R topics documented: energy-package . .2 centering distance matrices . .3 dcorT . .4 dcov.test . .6 1 2 energy-package dcov2d . .8 dcovU_stats . 10 disco . 11 distance correlation . 13 edist . 16 energy-deprecated . 18 energy.hclust . 18 eqdist.etest . 21 EVnormal . 23 indep.etest . 24 indep.test . 25 kgroups . 27 mvI.test . 29 mvnorm.test . 31 normal.test . 32 pdcor . 34 Poisson Tests . 35 sortrank . 38 Unbiased distance covariance . 39 U_product . 40 Index 42 energy-package E-statistics: Multivariate Inference via the Energy of Data Description Description: E-statistics (energy) tests and statistics for multivariate and univariate inference, in- cluding distance correlation, one-sample, two-sample, and multi-sample tests for comparing mul- tivariate distributions, are implemented.
    [Show full text]