On Wasserstein Two Sample Testing and Related Families Of
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
A Modified Energy Statistic for Unsupervised Anomaly Detection
A Modified Energy Statistic for Unsupervised Anomaly Detection Rupam Mukherjee GE Research, Bangalore, Karnataka, 560066, India [email protected], [email protected] ABSTRACT value. Anomaly is flagged when the distance metric exceeds a threshold and an alert is generated, thereby preventing sud- For prognostics in industrial applications, the degree of an- den unplanned failures. Although as is captured in (Goldstein omaly of a test point from a baseline cluster is estimated using & Uchida, 2016), anomaly may be both supervised and un- a statistical distance metric. Among different statistical dis- supervised in nature depending on the availability of labelled tance metrics, energy distance is an interesting concept based ground-truth data, in this paper anomaly detection is consid- on Newton’s Law of Gravitation, promising simpler comput- ered to be the unsupervised version which is more convenient ation than classical distance metrics. In this paper, we re- and more practical in many industrial systems. view the state of the art formulations of energy distance and point out several reasons why they are not directly applica- Choice of the statistical distance formulation has an important ble to the anomaly-detection problem. Thereby, we propose impact on the effectiveness of RUL estimation. In literature, a new energy-based metric called the -statistic which ad- statistical distances are described to be of two types as fol- P dresses these issues, is applicable to anomaly detection and lows. retains the computational simplicity of the energy distance. We also demonstrate its effectiveness on a real-life data-set. 1. Divergence measures: These estimate the distance (or, similarity) between probability distributions. -
Package 'Energy'
Package ‘energy’ May 27, 2018 Title E-Statistics: Multivariate Inference via the Energy of Data Version 1.7-4 Date 2018-05-27 Author Maria L. Rizzo and Gabor J. Szekely Description E-statistics (energy) tests and statistics for multivariate and univariate inference, including distance correlation, one-sample, two-sample, and multi-sample tests for comparing multivariate distributions, are implemented. Measuring and testing multivariate independence based on distance correlation, partial distance correlation, multivariate goodness-of-fit tests, clustering based on energy distance, testing for multivariate normality, distance components (disco) for non-parametric analysis of structured data, and other energy statistics/methods are implemented. Maintainer Maria Rizzo <[email protected]> Imports Rcpp (>= 0.12.6), stats, boot LinkingTo Rcpp Suggests MASS URL https://github.com/mariarizzo/energy License GPL (>= 2) NeedsCompilation yes Repository CRAN Date/Publication 2018-05-27 21:03:00 UTC R topics documented: energy-package . .2 centering distance matrices . .3 dcor.ttest . .4 dcov.test . .6 dcovU_stats . .8 disco . .9 distance correlation . 12 edist . 14 1 2 energy-package energy.hclust . 16 eqdist.etest . 19 indep.etest . 21 indep.test . 23 mvI.test . 25 mvnorm.etest . 27 pdcor . 28 poisson.mtest . 30 Unbiased distance covariance . 31 U_product . 32 Index 34 energy-package E-statistics: Multivariate Inference via the Energy of Data Description Description: E-statistics (energy) tests and statistics for multivariate and univariate inference, in- cluding distance correlation, one-sample, two-sample, and multi-sample tests for comparing mul- tivariate distributions, are implemented. Measuring and testing multivariate independence based on distance correlation, partial distance correlation, multivariate goodness-of-fit tests, clustering based on energy distance, testing for multivariate normality, distance components (disco) for non- parametric analysis of structured data, and other energy statistics/methods are implemented. -
Strength in Numbers: the Rising of Academic Statistics Departments In
Agresti · Meng Agresti Eds. Alan Agresti · Xiao-Li Meng Editors Strength in Numbers: The Rising of Academic Statistics DepartmentsStatistics in the U.S. Rising of Academic The in Numbers: Strength Statistics Departments in the U.S. Strength in Numbers: The Rising of Academic Statistics Departments in the U.S. Alan Agresti • Xiao-Li Meng Editors Strength in Numbers: The Rising of Academic Statistics Departments in the U.S. 123 Editors Alan Agresti Xiao-Li Meng Department of Statistics Department of Statistics University of Florida Harvard University Gainesville, FL Cambridge, MA USA USA ISBN 978-1-4614-3648-5 ISBN 978-1-4614-3649-2 (eBook) DOI 10.1007/978-1-4614-3649-2 Springer New York Heidelberg Dordrecht London Library of Congress Control Number: 2012942702 Ó Springer Science+Business Media New York 2013 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’s location, in its current version, and permission for use must always be obtained from Springer. -
Erich Leo Lehmann, 19172009
J. R. Statist. Soc. A (2010) 173, Part 3, pp. 683–689 Obituaries Erich Leo Lehmann, 1917–2009 Erich L. Lehmann, Professor Emeritus at the University of California, Berkeley, passed away on September 12th, 2009, aged 91 years. Erich was one of the engines that drove much of the development of theoretical and mainstream statistics during the second half of the 20th century. At the same time he kept himself aware of developments in applied statistics and prob- ability. He knew both the subject matter and the individuals developing our subject. He was a member of the powerful team of individuals that Jerzy Neyman built up at Berkeley in the 1950s. These included David Blackwell, Joe Hodges, Lucien Le Cam, Michel Loève, Henry Scheffé and Elizabeth Scott. Erich co-authored articles with each, except for the probabilist Loève. Erich was born in Strasbourg, France, on November 20th, 1917. His family moved to Frank- furt where they lived until 1933. When the Nazis came into power the family fled to Switzerland where Erich went to high school. In 1938, following his father’s advice, he went to study math- ematics at Trinity College, Cambridge. Erich remarked that he was ‘always the best student in mathematics’, but he did not enjoy the accompanying astronomy and physics. Of the latter he said ‘I hated it’ (in ‘A conversation with Erich L. Lehmann’, with Morris DeGroot, in volume 1 of Statistical Science (1986)). In 1940, again influenced by his father, Erich went to New York and then following a suggestion by R. Courant he crossed the country to study at Berkeley. -
Peeking at A/B Tests Why It MaErs, and What to Do About It
KDD 2017 Applied Data Science Paper KDD’17, August 13–17, 2017, Halifax, NS, Canada Peeking at A/B Tests Why it maers, and what to do about it Ramesh Johari∗ Pete Koomen† Stanford University Optimizely, Inc. [email protected] Leonid Pekelis‡ David Walsh§ Optimizely, Inc. Stanford University [email protected] [email protected] ABSTRACT obtain a very simple “user interface”, because these measures isolate is paper reports on novel statistical methodology, which has been the task of analyzing experiments from the details of their design deployed by the commercial A/B testing platform Optimizely to and implementation. communicate experimental results to their customers. Our method- Crucially, the inferential validity of these p-values and con- ology addresses the issue that traditional p-values and condence dence intervals requires the separation between the design and intervals give unreliable inference. is is because users of A/B analysis of experiments to be strictly maintained. In particular, the testing soware are known to continuously monitor these measures sample size must be xed in advance. Compare this to A/B testing as the experiment is running. We provide always valid p-values practice, where users oen continuously monitor the p-values and and condence intervals that are provably robust to this eect. condence intervals reported in order to re-adjust the sample size Not only does this make it safe for a user to continuously monitor, of an experiment dynamically [14]. Figure 1 shows a typical A/B but it empowers her to detect true eects more eciently. is testing dashboard that enables such behavior. -
Independence Measures
INDEPENDENCE MEASURES Beatriz Bueno Larraz M´asteren Investigaci´one Innovaci´onen Tecnolog´ıasde la Informaci´ony las Comunicaciones. Escuela Polit´ecnica Superior. M´asteren Matem´aticasy Aplicaciones. Facultad de Ciencias. UNIVERSIDAD AUTONOMA´ DE MADRID 09/03/2015 Advisors: Alberto Su´arezGonz´alez Jo´seRam´onBerrendero D´ıaz ii Acknowledgements This work would not have been possible without the knowledge acquired during both Master degrees. Their subjects have provided me essential notions to carry out this study. The fulfilment of this Master's thesis is the result of the guidances, suggestions and encour- agement of professors D. Jos´eRam´onBerrendero D´ıazand D. Alberto Su´arezGonzalez. They have guided me throughout these months with an open and generous spirit. They have showed an excellent willingness facing the doubts that arose me, and have provided valuable observa- tions for this research. I would like to thank them very much for the opportunity of collaborate with them in this project and for initiating me into research. I would also like to express my gratitude to the postgraduate studies committees of both faculties, specially to the Master's degrees' coordinators. All of them have concerned about my situation due to the change of the normative, and have answered all the questions that I have had. I shall not want to forget, of course, of my family and friends, who have supported and encouraged me during all this time. iii iv Contents 1 Introduction 1 2 Reproducing Kernel Hilbert Spaces (RKHS)3 2.1 Definitions and principal properties..........................3 2.2 Characterizing reproducing kernels..........................6 3 Maximum Mean Discrepancy (MMD) 11 3.1 Definition of MMD.................................. -
Erich L. Lehmann, the Lehmann Symposia, and November 20Th 1917 3
IMS Collections Vol. 0 (2009) 1{7 c Institute of Mathematical Statistics, 2009 1 1 arXiv: math.PR/0000021 2 2 3 3 4 Erich L. Lehmann, The Lehmann 4 5 5 th 6 Symposia, and November 20 1917 6 7 7 8 Javier Rojo?? 8 9 9 Rice University 10 10 11 11 12 12 13 The Lehmann Symposia originated as a result of a conversation I had in the 13 14 year 2001 with the, then, Director of the Centro de Investigaci´onen Matem´aticas 14 15 (CIMAT), Victor P´erez-Abreu.We both felt that there was an urgent need to 15 16 bring back into focus theoretical statistics and our proposed solution was a series 16 17 of Symposia that could serve as a forum for some of the exciting theoretical work 17 18 being done in statistics. The First Lehmann Symposium took place at CIMAT in 18 19 May of 2002. Most of the participants were Mexican colleagues. The program can 19 20 be seen at the site http://www.stat.rice.edu/lehmann/1st-Lehmann.html. The 20 21 second Lehmann Symposium { http://www.stat.rice.edu/lehmann/ { was held 21 22 in May of 2004 at the School of Engineering at Rice University. Initially, the venues 22 23 for the Symposia would alternate between CIMAT and Rice University. However, 23 24 for various reasons, some being financial, it was decided to hold the 3rd Lehmann 24 25 Symposium in the United States. 25 26 The original plans for the Third Lehmann Symposium were to hold the sym- 26 27 posium at the Mathematical Sciences Research Institute (MSRI) in Berkeley dur- 27 28 ing the month of November of 2007. -
The Energy Goodness-Of-Fit Test and E-M Type Estimator for Asymmetric Laplace Distributions
THE ENERGY GOODNESS-OF-FIT TEST AND E-M TYPE ESTIMATOR FOR ASYMMETRIC LAPLACE DISTRIBUTIONS John Haman A Dissertation Submitted to the Graduate College of Bowling Green State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY August 2018 Committee: Maria Rizzo, Advisor Joseph Chao, Graduate Faculty Representative Wei Ning Craig Zirbel Copyright c 2018 John Haman All rights reserved iii ABSTRACT Maria Rizzo, Advisor Recently the asymmetric Laplace distribution and its extensions have gained attention in the statistical literature. This may be due to its relatively simple form and its ability to model skew- ness and outliers. For these reasons, the asymmetric Laplace distribution is a reasonable candidate model for certain data that arise in finance, biology, engineering, and other disciplines. For a prac- titioner that wishes to use this distribution, it is very important to check the validity of the model before making inferences that depend on the model. These types of questions are traditionally addressed by goodness-of-fit tests in the statistical literature. In this dissertation, a new goodness-of-fit test is proposed based on energy statistics, a widely applicable class of statistics for which one application is goodness-of-fit testing. The energy goodness-of-fit test has a number of desirable properties. It is consistent against general alter- natives. If the null hypothesis is true, the distribution of the test statistic converges in distribution to an infinite, weighted sum of Chi-square random variables. In addition, we find through simula- tion that the energy test is among the most powerful tests for the asymmetric Laplace distribution in the scenarios considered. -
From Distance Correlation to Multiscale Graph Correlation Arxiv
From Distance Correlation to Multiscale Graph Correlation Cencheng Shen∗1, Carey E. Priebey2, and Joshua T. Vogelsteinz3 1Department of Applied Economics and Statistics, University of Delaware 2Department of Applied Mathematics and Statistics, Johns Hopkins University 3Department of Biomedical Engineering and Institute of Computational Medicine, Johns Hopkins University October 2, 2018 Abstract Understanding and developing a correlation measure that can detect general de- pendencies is not only imperative to statistics and machine learning, but also crucial to general scientific discovery in the big data age. In this paper, we establish a new framework that generalizes distance correlation — a correlation measure that was recently proposed and shown to be universally consistent for dependence testing against all joint distributions of finite moments — to the Multiscale Graph Correla- tion (MGC). By utilizing the characteristic functions and incorporating the nearest neighbor machinery, we formalize the population version of local distance correla- tions, define the optimal scale in a given dependency, and name the optimal local correlation as MGC. The new theoretical framework motivates a theoretically sound ∗[email protected] arXiv:1710.09768v3 [stat.ML] 30 Sep 2018 [email protected] [email protected] 1 Sample MGC and allows a number of desirable properties to be proved, includ- ing the universal consistency, convergence and almost unbiasedness of the sample version. The advantages of MGC are illustrated via a comprehensive set of simula- tions with linear, nonlinear, univariate, multivariate, and noisy dependencies, where it loses almost no power in monotone dependencies while achieving better perfor- mance in general dependencies, compared to distance correlation and other popular methods. -
Level Sets Based Distances for Probability Measures and Ensembles with Applications Arxiv:1504.01664V1 [Stat.ME] 7 Apr 2015
Level Sets Based Distances for Probability Measures and Ensembles with Applications Alberto Mu~noz1, Gabriel Martos1 and Javier Gonz´alez2 1Department of Statistics, University Carlos III of Madrid Spain. C/ Madrid, 126 - 28903, Getafe (Madrid), Spain. [email protected], [email protected] 2Sheffield Institute for Translational Neuroscience, Department of Computer Science, University of Sheffield. Glossop Road S10 2HQ, Sheffield, UK. [email protected] ABSTRACT In this paper we study Probability Measures (PM) from a functional point of view: we show that PMs can be considered as functionals (generalized func- tions) that belong to some functional space endowed with an inner product. This approach allows us to introduce a new family of distances for PMs, based on the action of the PM functionals on `interesting' functions of the sample. We propose a specific (non parametric) metric for PMs belonging to this class, arXiv:1504.01664v1 [stat.ME] 7 Apr 2015 based on the estimation of density level sets. Some real and simulated data sets are used to measure the performance of the proposed distance against a battery of distances widely used in Statistics and related areas. 1 Introduction Probability metrics, also known as statistical distances, are of fundamental importance in Statistics. In essence, a probability metric it is a measure that quantifies how (dis)similar are two random quantities, in particular two probability measures (PM). Typical examples of the use of probability metrics in Statistics are homogeneity, independence and goodness of fit tests. For instance there are some goodness of fit tests based on the use of the χ2 distance and others that use the Kolmogorov-Smirnoff statistics, which corresponds to the choice of the supremum distance between two PMs. -
The Perfect Marriage and Much More: Combining Dimension Reduction, Distance Measures and Covariance
The Perfect Marriage and Much More: Combining Dimension Reduction, Distance Measures and Covariance Ravi Kashyap SolBridge International School of Business / City University of Hong Kong July 15, 2019 Keywords: Dimension Reduction; Johnson-Lindenstrauss; Bhattacharyya; Distance Measure; Covariance; Distribution; Uncertainty JEL Codes: C44 Statistical Decision Theory; C43 Index Numbers and Aggregation; G12 Asset Pricing Mathematics Subject Codes: 60E05 Distributions: general theory; 62-07 Data analysis; 62P20 Applications to economics Edited Version: Kashyap, R. (2019). The perfect marriage and much more: Combining dimension reduction, distance measures and covariance. Physica A: Statistical Mechanics and its Applications, XXX, XXX-XXX. Contents 1 Abstract 4 2 Introduction 4 3 Literature Review of Methodological Fundamentals 5 3.1 Bhattacharyya Distance . .5 3.2 Dimension Reduction . .9 4 Intuition for Dimension Reduction 10 arXiv:1603.09060v6 [q-fin.TR] 12 Jul 2019 4.1 Game of Darts . 10 4.2 The Merits and Limits of Four Physical Dimensions . 11 5 Methodological Innovations 11 5.1 Normal Log-Normal Mixture . 11 5.2 Normal Normal Product . 13 5.3 Truncated Normal / Multivariate Normal . 15 5.4 Discrete Multivariate Distribution . 19 5.5 Covariance and Distance . 20 1 6 Market-structure, Microstructure and Other Applications 24 6.1 Asset Pricing Application . 25 6.2 Biological Application . 26 7 Empirical Illustrations 27 7.1 Comparison of Security Prices across Markets . 27 7.2 Comparison of Security Trading Volumes, High-Low-Open-Close Prices and Volume-Price Volatilities . 30 7.3 Speaking Volumes Of: Comparison of Trading Volumes . 30 7.4 A Pricey Prescription: Comparison of Prices (Open, Close, High and Low) . -
An Efficient and Distribution-Free Two-Sample Test Based on Energy Statistics and Random Projections
An Efficient and Distribution-Free Two-Sample Test Based on Energy Statistics and Random Projections Cheng Huang and Xiaoming Huo July 14, 2017 Abstract A common disadvantage in existing distribution-free two-sample testing approaches is that the com- putational complexity could be high. Specifically, if the sample size is N, the computational complex- ity of those two-sample tests is at least O(N 2). In this paper, we develop an efficient algorithm with complexity O(N log N) for computing energy statistics in univariate cases. For multivariate cases, we introduce a two-sample test based on energy statistics and random projections, which enjoys the O(KN log N) computational complexity, where K is the number of random projections. We name our method for multivariate cases as Randomly Projected Energy Statistics (RPES). We can show RPES achieves nearly the same test power with energy statistics both theoretically and empirically. Numerical experiments also demonstrate the efficiency of the proposed method over the competitors. 1 Introduction Testing the equality of distributions is one of the most fundamental problems in statistics. Formally, let F and G denote two distribution function in Rp. Given independent and identically distributed samples X ,...,X and Y ,...,Y { 1 n} { 1 m} from two unknown distribution F and G, respectively, the two-sample testing problem is to test hypotheses : F = G v.s. : F = G. H0 H1 6 There are a few recent advances in two sample testing that attract attentions in statistics and machine learning communities. [17] propose a test statistic based on the optimal non-bipartite matching, and, [3] develop a test based on shortest Hamiltonian path, both of which are distribution-free.