Statistical Distances and Probability Metrics for Multivariate Data, Ensembles and Probability Distributions

Total Page:16

File Type:pdf, Size:1020Kb

Statistical Distances and Probability Metrics for Multivariate Data, Ensembles and Probability Distributions Ph.D. THESIS Statistical Distances and Probability Metrics for Multivariate Data, Ensembles and Probability Distributions GABRIEL MARTOS VENTURINI ADVISOR:ALBERTO MUNOZ˜ GARC´IA Department of Statistics UNIVERSIDAD CARLOS III DE MADRID Leganes´ (Madrid, Spain) June, 2015 Universidad Carlos III de Madrid PH.D. THESIS Statistical Distances and Probability Metrics for Multivariate Data, Ensembles and Probability Distributions Author: Gabriel Martos Venturini Advisor: Alberto Munoz˜ Garc´ıa DEPARTMENT OF STATISTICS Leganes´ (Madrid, Spain), June, 2015 c 2015 Gabriel Martos Venturini All Rights Reserved Acknowledgements I would like to thank to my family and friends. Without the encouragement and support of all of them it would be impossible to reach the goal. I also thank to Alberto Munoz,˜ my PhD adviser, for giving me the opportunity to learn and discover many things to his side. Throughout these years we shared many things and he has been a great teacher and friend. To Javier Gonzalez´ for his support and advice and to Laura Sangalli, my adviser in Milan,´ for the time and effort dedicated to my work. En primer lugar quiero agradecer profundamente a mi familia y mis amigos por acompanarme˜ en este proyecto. Sin el apoyo de estas personas de seguro hubiera sido imposible llegar hasta el final. En segundo lugar, quiero agradecer a Alberto Munoz,˜ mi director de tesis, por darme la oportunidad de aprender y descubrir muchas cosas a su lado. A lo largo de estos anos˜ de tra- bajo conjunto hemos compartido muchas cosas y ha sido un gran maestro y amigo. A Javier Gonzalez´ por el apoyo, los consejos y la disposicion.´ A Laura Sangalli, mi tutora en Milan,´ por recibirme y dedicarme tiempo y esfuerzo. i ii Abstract The use of distance measures in Statistics is of fundamental importance in solving practical problems, such us hypothesis testing, independence contrast, goodness of fit tests, classifica- tion tasks, outlier detection and density estimation methods, to name just a few. The Mahalanobis distance was originally developed to compute the distance from a point to the center of a distribution taking into account the distribution of the data, in this case the normal distribution. This is the only distance measure in the statistical literature that takes into account the probabilistic information of the data. In this thesis we address the study of different distance measures that share a fundamental characteristic: all the proposed distances incorporate probabilistic information. The thesis is organized as follows: In Chapter 1 we motivate the problems addressed in this thesis. In Chapter 2 we present the usual definitions and properties of the different dis- tance measures for multivariate data and for probability distributions treated in the statistical literature. In Chapter 3 we propose a distance that generalizes the Mahalanobis distance to the case where the distribution of the data is not Gaussian. To this aim, we introduce a Mercer Kernel based on the distribution of the data at hand. The Mercer Kernel induces distances from a point to the center of a distribution. In this chapter we also present a plug-in estimator of the distance that allows us to solve classification and outlier detection problems in an efficient way. In Chapter 4 of this thesis, we present two new distance measures for multivariate data that incorporate the probabilistic information contained in the sample. In this chapter we also introduce two estimation methods for the proposed distances and we study empirically their convergence. In the experimental section of Chapter 4 we solve classification problems and obtain better results than several standard classification methods in the literature of discrimi- iii nant analysis. In Chapter 5 we propose a new family of probability metrics and we study its theoreti- cal properties. We introduce an estimation method to compute the proposed distances that is based on the estimation of the level sets, avoiding in this way the difficult task of density estimation. In this chapter we show that the proposed distance is able to solve hypothesis tests and classification problems in general contexts, obtaining better results than other standard methods in statistics. In Chapter 6 we introduce a new distance for sets of points. To this end, we define a dissim- ilarity measure for points by using a Mercer Kernel, that is extended later to a Mercer Kernel for sets of points. In this way, we are able to induce a dissimilarity index for sets of points that it is used as an input for an adaptive k-mean clustering algorithm. The proposed clustering algorithm considers an alignment of the sets of points by taking into account a wide range of possible wrapping functions. This chapter presents an application to clustering neuronal spike trains, a relevant problem in neural coding. Finally, in Chapter 7, we present the general conclusions of this thesis and the future re- search lines. iv Resumen En Estad´ıstica el uso de medidas de distancia resulta de vital importancia a la hora de resolver problemas de ´ındole practica.´ Algunos metodos´ que hacen uso de distancias en estad´ıstica son: Contrastes de hipotesis,´ de independencia, de bondad de ajuste, metodos´ de clasificacion,´ deteccion´ de at´ıpicos y estimacion´ de densidad, entre otros. La distancia de Mahalanobis, que fue disenada˜ originalmente para hallar la distancia de un punto al centro de una distribucion´ usando informacion´ de la distribucion´ ambiente, en este caso la normal. Constituye el unico´ ejemplo existente en estad´ıstica de distancia que con- sidera informacion´ probabil´ıstica. En esta tesis abordamos el estudio de diferentes medidas de distancia que comparten una caracter´ıstica en comun:´ todas ellas incorporan informacion´ probabil´ıstica. El trabajo se encuentra organizado de la siguiente manera: En el Cap´ıtulo 1 motivamos los problemas abordados en esta tesis. En el Cap´ıtulo 2 de este trabajo presentamos las defini- ciones y propiedades de las diferentes medidas de distancias para datos multivariantes y para medidas de probabilidad existentes en la literatura. En el Cap´ıtulo 3 se propone una distancia que generaliza la distancia de Mahalanobis al caso en que la distribucion´ de los datos no es Gaussiana. Para ello se propone un Nucleo´ (kernel) de Mercer basado en la densidad (muestral) de los datos que nos confiere la posi- bilidad de inducir distancias de un punto a una distribucion.´ En este cap´ıtulo presentamos ademas´ un estimador plug-in de la distancia que nos permite resolver, de manera practica´ y eficiente, problemas de deteccion´ de at´ıpicos y problemas de clasificacion´ mejorando los resul- tados obtenidos al utilizar otros metodos´ de la literatura. Continuando con el estudio de medidas de distancia, en el Cap´ıtulo 4 de esta tesis se pro- ponen dos nuevas medidas de distancia para datos multivariantes incorporando informacion´ v probabil´ıstica contenida en la muestra. En este cap´ıtulo proponemos tambien´ dos metodos´ de estimacion´ eficientes para las distancias propuestas y estudiamos de manera emp´ırica su con- vergencia. En la seccion´ experimental del Cap´ıtulo 4 se resulven problemas de clasificacion´ con las medidas de distancia propuestas, mejorando los resultados obtenidos con procedimientos habitualmente utilizados en la literatura de analisis´ discriminante. En el Cap´ıtulo 5 proponemos una familia de distancias entre medidas de probabilidad. Se estudian tambien´ las propiedades teoricas´ de la familia de metricas´ propuesta y se establece un metodo´ de estimacion´ de las distancias basado en la estimacion´ de los conjuntos de nivel (definidos en este cap´ıtulo), evitando as´ı la estimacion´ directa de la densidad. En este cap´ıtulo se resuelven diferentes problemas de ´ındole practica´ con las metricas´ propuestas: Contraste de hipotesis´ y problemas de clasificacion´ en diferentes contextos. Los resultados emp´ıricos de este cap´ıtulo demuestran que la distancia propuesta es superior a otros metodos´ habituales de la literatura. Para finalizar con el estudio de distancias, en el Cap´ıtulo 6 se propone una medida de dis- tancia entre conjuntos de puntos. Para ello, se define una medida de similaridad entre puntos a traves´ de un kernel de Mercer. A continuacion´ se extiende el kernel para puntos a un kernel de Mercer para conjuntos de puntos. De esta forma, el Nucleo´ de Mercer para conjuntos de puntos es utilizado para inducir una metrica´ (un ´ındice de disimilaridad) entre conjuntos de puntos. En este cap´ıtulo se propone un metodo´ de clasificacion´ por k-medias que utliza la metrica´ propuesta y que contempla, ademas,´ la posibilidad de alinear los conjuntos de puntos en cada etapa de la construccion´ de los clusters. En este cap´ıtulo presentamos una aplicacion´ relativa al estudio de la decodificacion´ neuronal, donde utilizamos el metodo´ propuesto para encontrar clusters de neuronas con patrones de funcionamiento similares. Finalmente en el Cap´ıtulo 7 se presentan las conclusiones generales de este trabajo y las futuras l´ıneas de investigacion.´ vi Contents List of Figures xi List of Tables xv 1 Introduction 1 1.1 Overview of the Thesis and Contributions ....................... 5 2 Background: Distances and Geometry in Statistic 11 2.1 Distances and Similarities between Points ....................... 12 2.1.1 Bregman divergences .............................. 14 2.2 Probability Metrics .................................... 16 2.2.1 Statistical divergences .............................. 17 2.2.2 Dissimilarity measures in the
Recommended publications
  • Conditional Central Limit Theorems for Gaussian Projections
    1 Conditional Central Limit Theorems for Gaussian Projections Galen Reeves Abstract—This paper addresses the question of when projec- The focus of this paper is on the typical behavior when the tions of a high-dimensional random vector are approximately projections are generated randomly and independently of the Gaussian. This problem has been studied previously in the context random variables. Given an n-dimensional random vector X, of high-dimensional data analysis, where the focus is on low- dimensional projections of high-dimensional point clouds. The the k-dimensional linear projection Z is defined according to focus of this paper is on the typical behavior when the projections Z =ΘX, (1) are generated by an i.i.d. Gaussian projection matrix. The main results are bounds on the deviation between the conditional where Θ is a k n random matrix that is independent of X. distribution of the projections and a Gaussian approximation, Throughout this× paper it assumed that X has finite second where the conditioning is on the projection matrix. The bounds are given in terms of the quadratic Wasserstein distance and moment and that the entries of Θ are i.i.d. Gaussian random relative entropy and are stated explicitly as a function of the variables with mean zero and variance 1/n. number of projections and certain key properties of the random Our main results are bounds on the deviation between the vector. The proof uses Talagrand’s transportation inequality and conditional distribution of Z given Θ and a Gaussian approx- a general integral-moment inequality for mutual information.
    [Show full text]
  • A Modified Energy Statistic for Unsupervised Anomaly Detection
    A Modified Energy Statistic for Unsupervised Anomaly Detection Rupam Mukherjee GE Research, Bangalore, Karnataka, 560066, India [email protected], [email protected] ABSTRACT value. Anomaly is flagged when the distance metric exceeds a threshold and an alert is generated, thereby preventing sud- For prognostics in industrial applications, the degree of an- den unplanned failures. Although as is captured in (Goldstein omaly of a test point from a baseline cluster is estimated using & Uchida, 2016), anomaly may be both supervised and un- a statistical distance metric. Among different statistical dis- supervised in nature depending on the availability of labelled tance metrics, energy distance is an interesting concept based ground-truth data, in this paper anomaly detection is consid- on Newton’s Law of Gravitation, promising simpler comput- ered to be the unsupervised version which is more convenient ation than classical distance metrics. In this paper, we re- and more practical in many industrial systems. view the state of the art formulations of energy distance and point out several reasons why they are not directly applica- Choice of the statistical distance formulation has an important ble to the anomaly-detection problem. Thereby, we propose impact on the effectiveness of RUL estimation. In literature, a new energy-based metric called the -statistic which ad- statistical distances are described to be of two types as fol- P dresses these issues, is applicable to anomaly detection and lows. retains the computational simplicity of the energy distance. We also demonstrate its effectiveness on a real-life data-set. 1. Divergence measures: These estimate the distance (or, similarity) between probability distributions.
    [Show full text]
  • Package 'Energy'
    Package ‘energy’ May 27, 2018 Title E-Statistics: Multivariate Inference via the Energy of Data Version 1.7-4 Date 2018-05-27 Author Maria L. Rizzo and Gabor J. Szekely Description E-statistics (energy) tests and statistics for multivariate and univariate inference, including distance correlation, one-sample, two-sample, and multi-sample tests for comparing multivariate distributions, are implemented. Measuring and testing multivariate independence based on distance correlation, partial distance correlation, multivariate goodness-of-fit tests, clustering based on energy distance, testing for multivariate normality, distance components (disco) for non-parametric analysis of structured data, and other energy statistics/methods are implemented. Maintainer Maria Rizzo <[email protected]> Imports Rcpp (>= 0.12.6), stats, boot LinkingTo Rcpp Suggests MASS URL https://github.com/mariarizzo/energy License GPL (>= 2) NeedsCompilation yes Repository CRAN Date/Publication 2018-05-27 21:03:00 UTC R topics documented: energy-package . .2 centering distance matrices . .3 dcor.ttest . .4 dcov.test . .6 dcovU_stats . .8 disco . .9 distance correlation . 12 edist . 14 1 2 energy-package energy.hclust . 16 eqdist.etest . 19 indep.etest . 21 indep.test . 23 mvI.test . 25 mvnorm.etest . 27 pdcor . 28 poisson.mtest . 30 Unbiased distance covariance . 31 U_product . 32 Index 34 energy-package E-statistics: Multivariate Inference via the Energy of Data Description Description: E-statistics (energy) tests and statistics for multivariate and univariate inference, in- cluding distance correlation, one-sample, two-sample, and multi-sample tests for comparing mul- tivariate distributions, are implemented. Measuring and testing multivariate independence based on distance correlation, partial distance correlation, multivariate goodness-of-fit tests, clustering based on energy distance, testing for multivariate normality, distance components (disco) for non- parametric analysis of structured data, and other energy statistics/methods are implemented.
    [Show full text]
  • Regression Assumptions and Diagnostics
    Newsom Psy 522/622 Multiple Regression and Multivariate Quantitative Methods, Winter 2021 1 Overview of Regression Assumptions and Diagnostics Assumptions Statistical assumptions are determined by the mathematical implications for each statistic, and they set the guideposts within which we might expect our sample estimates to be biased or our significance tests to be accurate. Violations of assumptions therefore should be taken seriously and investigated, but they do not necessarily always indicate that the statistical test will be inaccurate. The complication is that it is almost never possible to know for certain if an assumption has been violated and it is often a judgement call by the research on whether or not a violation has occurred or is serious. You will likely find that the wording of and lists of regression assumptions provided in regression texts tends to vary, but here is my summary. Linearity. Regression is a summary of the relationship between X and Y that uses a straight line. Therefore, the estimate of that relationship holds only to the extent that there is a consistent increase or decrease in Y as X increases. There might be a relationship (even a perfect one) between the two variables that is not linear, or some of the relationship may be of a linear form and some of it may be a nonlinear form (e.g., quadratic shape). The correlation coefficient and the slope can only be accurate about the linear portion of the relationship. Normal distribution of residuals. For t-tests and ANOVA, we discussed that there is an assumption that the dependent variable is normally distributed in the population.
    [Show full text]
  • Independence Measures
    INDEPENDENCE MEASURES Beatriz Bueno Larraz M´asteren Investigaci´one Innovaci´onen Tecnolog´ıasde la Informaci´ony las Comunicaciones. Escuela Polit´ecnica Superior. M´asteren Matem´aticasy Aplicaciones. Facultad de Ciencias. UNIVERSIDAD AUTONOMA´ DE MADRID 09/03/2015 Advisors: Alberto Su´arezGonz´alez Jo´seRam´onBerrendero D´ıaz ii Acknowledgements This work would not have been possible without the knowledge acquired during both Master degrees. Their subjects have provided me essential notions to carry out this study. The fulfilment of this Master's thesis is the result of the guidances, suggestions and encour- agement of professors D. Jos´eRam´onBerrendero D´ıazand D. Alberto Su´arezGonzalez. They have guided me throughout these months with an open and generous spirit. They have showed an excellent willingness facing the doubts that arose me, and have provided valuable observa- tions for this research. I would like to thank them very much for the opportunity of collaborate with them in this project and for initiating me into research. I would also like to express my gratitude to the postgraduate studies committees of both faculties, specially to the Master's degrees' coordinators. All of them have concerned about my situation due to the change of the normative, and have answered all the questions that I have had. I shall not want to forget, of course, of my family and friends, who have supported and encouraged me during all this time. iii iv Contents 1 Introduction 1 2 Reproducing Kernel Hilbert Spaces (RKHS)3 2.1 Definitions and principal properties..........................3 2.2 Characterizing reproducing kernels..........................6 3 Maximum Mean Discrepancy (MMD) 11 3.1 Definition of MMD..................................
    [Show full text]
  • IR-IITBHU at TREC 2016 Open Search Track: Retrieving Documents Using Divergence from Randomness Model in Terrier
    IR-IITBHU at TREC 2016 Open Search Track: Retrieving documents using Divergence From Randomness model in Terrier Mitodru Niyogi1 and Sukomal Pal2 1Department of Information Technology, Government College of Engineering & Ceramic Technology, Kolkata 2Department of Computer Science & Engineering, Indian Institute of Technology(BHU), Varanasi Abstract In our participation at TREC 2016 Open Search Track which focuses on ad-hoc scientic literature search, we used Terrier, a modular and a scalable Information Retrieval framework as a tool to rank documents. The organizers provided live data as documents, queries and user interac- tions from real search engine that were available through Living Lab API framework. The data was then converted into TREC format to be used in Terrier. We used Divergence from Randomness (DFR) model, specically, the Inverse expected document frequency model for randomness, the ratio of two Bernoulli's processes for rst normalisation, and normalisation 2 for term frequency normalization with natural logarithm, i.e., In_expC2 model to rank the available documents in response to a set of queries. Al- together we submit 391 runs for sites CiteSeerX and SSOAR to the Open Search Track via the Living Lab API framework. We received an `out- come' of 0.72 for test queries and 0.62 for train queries of site CiteSeerX at the end of Round 3 Test Period where, the outcome is computed as: #wins / (#wins + #losses). A `win' occurs when the participant achieves more clicks on his documents than those of the site and `loss' otherwise. Complete relevance judgments is awaited at the moment. We look forward to getting the users' feedback and work further with them.
    [Show full text]
  • Subnational Inequality Divergence
    Subnational Inequality Divergence Tom VanHeuvelen1 University of Minnesota Department of Sociology Abstract How have inequality levels across local labor markets in the subnational United States changed over the past eight decades? In this study, I examine inequality divergence, or the inequality of inequalities. While divergence trends of central tendencies such as per capita income have been well documented, less is known about the descriptive trends or contributing mechanisms for inequality. In this study, I construct wage inequality measures in 722 local labor markets covering the entire contiguous United States across 22 waves of Census and American Community Survey data from 1940-2019 to assess the historical trends of inequality divergence. I apply variance decomposition and counterfactual techniques to develop main conclusions. Inequality divergence follows a u-shaped pattern, declining through 1990 but with contemporary divergence at as high a level as any time in the past 80 years. Early era convergence occurred broadly and primarily worked to reduce interregional differences, whereas modern inequality divergence operates through a combination of novel mechanisms, most notably through highly unequal urban areas separating from other labor markets. Overall, results show geographical fragmentation of inequality underneath overall inequality growth in recent years, highlighting the fundamental importance of spatial trends for broader stratification outcomes. 1 Correspondence: [email protected]. A previous version of this manuscript was presented at the 2021 Population Association of American annual conference. Thank you to Jane VanHeuvelen and Peter Catron for their helpful comments. Recent changes in the United States have situated geographical residence as a key pillar of the unequal distribution of economic resources (Austin et al.
    [Show full text]
  • Mahalanobis Distance
    Mahalanobis Distance Here is a scatterplot of some multivariate data (in two dimensions): accepte d What can we make of it when the axes are left out? Introduce coordinates that are suggested by the data themselves. The origin will be at the centroid of the points (the point of their averages). The first coordinate axis (blue in the next figure) will extend along the "spine" of the points, which (by definition) is any direction in which the variance is the greatest. The second coordinate axis (red in the figure) will extend perpendicularly to the first one. (In more than two dimensions, it will be chosen in that perpendicular direction in which the variance is as large as possible, and so on.) We need a scale . The standard deviation along each axis will do nicely to establish the units along the axes. Remember the 68-95-99.7 rule: about two-thirds (68%) of the points should be within one unit of the origin (along the axis); about 95% should be within two units. That makes it ea sy to eyeball the correct units. For reference, this figure includes the unit circle in these units: That doesn't really look like a circle, does it? That's because this picture is distorted (as evidenced by the different spacings among the numbers on the two axes). Let's redraw it with the axes in their proper orientations--left to right and bottom to top-- and with a unit aspect ratio so that one unit horizontally really does equal one unit vertically: You measure the Mahalanobis distance in this picture rather than in the original.
    [Show full text]
  • Multivariate Statistics Chapter 5: Multidimensional Scaling
    Multivariate Statistics Chapter 5: Multidimensional scaling Pedro Galeano Departamento de Estad´ıstica Universidad Carlos III de Madrid [email protected] Course 2017/2018 Master in Mathematical Engineering Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 5 Master in Mathematical Engineering 1 / 37 1 Introduction 2 Statistical distances 3 Metric MDS 4 Non-metric MDS Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 5 Master in Mathematical Engineering 2 / 37 Introduction As we have seen in previous chapters, principal components and factor analysis are important dimension reduction tools. However, in many applied sciences, data is recorded as ranked information. For example, in marketing, one may record \product A is better than product B". Multivariate observations therefore often have mixed data characteristics and contain information that would enable us to employ one of the multivariate techniques presented so far. Multidimensional scaling (MDS) is a method based on proximities between ob- jects, subjects, or stimuli used to produce a spatial representation of these items. MDS is a dimension reduction technique since the aim is to find a set of points in low dimension (typically two dimensions) that reflect the relative configuration of the high-dimensional data objects. Pedro Galeano (Course 2017/2018) Multivariate Statistics - Chapter 5 Master in Mathematical Engineering 3 / 37 Introduction The proximities between objects are defined as any set of numbers that express the amount of similarity or dissimilarity between pairs of objects. In contrast to the techniques considered so far, MDS does not start from a n×p dimensional data matrix, but from a n × n dimensional dissimilarity or distance 0 matrix, D, with elements δii 0 or dii 0 , respectively, for i; i = 1;:::; n.
    [Show full text]
  • The Energy Goodness-Of-Fit Test and E-M Type Estimator for Asymmetric Laplace Distributions
    THE ENERGY GOODNESS-OF-FIT TEST AND E-M TYPE ESTIMATOR FOR ASYMMETRIC LAPLACE DISTRIBUTIONS John Haman A Dissertation Submitted to the Graduate College of Bowling Green State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY August 2018 Committee: Maria Rizzo, Advisor Joseph Chao, Graduate Faculty Representative Wei Ning Craig Zirbel Copyright c 2018 John Haman All rights reserved iii ABSTRACT Maria Rizzo, Advisor Recently the asymmetric Laplace distribution and its extensions have gained attention in the statistical literature. This may be due to its relatively simple form and its ability to model skew- ness and outliers. For these reasons, the asymmetric Laplace distribution is a reasonable candidate model for certain data that arise in finance, biology, engineering, and other disciplines. For a prac- titioner that wishes to use this distribution, it is very important to check the validity of the model before making inferences that depend on the model. These types of questions are traditionally addressed by goodness-of-fit tests in the statistical literature. In this dissertation, a new goodness-of-fit test is proposed based on energy statistics, a widely applicable class of statistics for which one application is goodness-of-fit testing. The energy goodness-of-fit test has a number of desirable properties. It is consistent against general alter- natives. If the null hypothesis is true, the distribution of the test statistic converges in distribution to an infinite, weighted sum of Chi-square random variables. In addition, we find through simula- tion that the energy test is among the most powerful tests for the asymmetric Laplace distribution in the scenarios considered.
    [Show full text]
  • From Distance Correlation to Multiscale Graph Correlation Arxiv
    From Distance Correlation to Multiscale Graph Correlation Cencheng Shen∗1, Carey E. Priebey2, and Joshua T. Vogelsteinz3 1Department of Applied Economics and Statistics, University of Delaware 2Department of Applied Mathematics and Statistics, Johns Hopkins University 3Department of Biomedical Engineering and Institute of Computational Medicine, Johns Hopkins University October 2, 2018 Abstract Understanding and developing a correlation measure that can detect general de- pendencies is not only imperative to statistics and machine learning, but also crucial to general scientific discovery in the big data age. In this paper, we establish a new framework that generalizes distance correlation — a correlation measure that was recently proposed and shown to be universally consistent for dependence testing against all joint distributions of finite moments — to the Multiscale Graph Correla- tion (MGC). By utilizing the characteristic functions and incorporating the nearest neighbor machinery, we formalize the population version of local distance correla- tions, define the optimal scale in a given dependency, and name the optimal local correlation as MGC. The new theoretical framework motivates a theoretically sound ∗[email protected] arXiv:1710.09768v3 [stat.ML] 30 Sep 2018 [email protected] [email protected] 1 Sample MGC and allows a number of desirable properties to be proved, includ- ing the universal consistency, convergence and almost unbiasedness of the sample version. The advantages of MGC are illustrated via a comprehensive set of simula- tions with linear, nonlinear, univariate, multivariate, and noisy dependencies, where it loses almost no power in monotone dependencies while achieving better perfor- mance in general dependencies, compared to distance correlation and other popular methods.
    [Show full text]
  • A Generalized Divergence for Statistical Inference
    A GENERALIZED DIVERGENCE FOR STATISTICAL INFERENCE Abhik Ghosh, Ian R. Harris, Avijit Maji Ayanendranath Basu and Leandro Pardo TECHNICAL REPORT NO. BIRU/2013/3 2013 BAYESIAN AND INTERDISCIPLINARY RESEARCH UNIT INDIAN STATISTICAL INSTITUTE 203, Barrackpore Trunk Road Kolkata – 700 108 INDIA A Generalized Divergence for Statistical Inference Abhik Ghosh Indian Statistical Institute, Kolkata, India. Ian R. Harris Southern Methodist University, Dallas, USA. Avijit Maji Indian Statistical Institute, Kolkata, India. Ayanendranath Basu Indian Statistical Institute, Kolkata, India. Leandro Pardo Complutense University, Madrid, Spain. Summary. The power divergence (PD) and the density power divergence (DPD) families have proved to be useful tools in the area of robust inference. The families have striking similarities, but also have fundamental differences; yet both families are extremely useful in their own ways. In this paper we provide a comprehensive description of the two families and tie in their role in statistical theory and practice. At the end, the families are seen to be a part of a superfamily which contains both of these families as special cases. In the process, the limitation of the influence function as an effective descriptor of the robustness of the estimator is also demonstrated. Keywords: Robust Estimation, Divergence, Influence Function 2 Ghosh et al. 1. Introduction The density-based minimum divergence approach is an useful technique in para- metric inference. Here the closeness of the data and the model is quantified by a suitable measure of density-based divergence between the data density and the model density. Many of these methods have been particularly useful because of the strong robustness properties that they inherently possess.
    [Show full text]