Robust Statistics: a Functional Approach

Journéesde la robustesse, UAPV, 2011 Submitted to the Annals of ISUP ROBUST STATISTICS: A FUNCTIONAL APPROACH By Anne Ruiz-Gazen Toulouse School of Economics Abstract For a given statistical method, good properties are usually ob- tained under strong hypotheses that are likely not to be verified in practice. In this context, the robustness of a method refers to its capability of keep- ing good properties even when the hypotheses are not verified. Statistics imply the treatment of data sets that are assumed to be the realization of random variables. The assumptions usually concern the probability distribution from which the data set is generated. One field in Statistics called \robust statistics" consists in defining some measures of robustness and proposes robust methods in the sense of these measures. After giving some basic notions of statistics, we present different well-known measures of robustness from the \robust statistics" field. In order to illustrate the ideas we consider the simple case of estimating a \mean" using the arithmetic mean, the median, the trimmed and the winzorised mean. The paper is an introductory presentation of robust statistics for readers who are not necessarily statisticians but it contains some technical details. The reader not interested in such details can read the introduction of each section and subsection where general ideas are given in simple words. Résumé Les bonnes propriétésd'une méthode statistique donnéesont généralement obtenues sous des hypothèsesfortes qui ne sont pas vérifiées en pratique. Dans ce contexte, la robustesse d'une méthode consiste en sa capacitéàgarder de bonnes propriétésmêmelorsque les hypothèsesne sont pas vérifiées.La Statistique implique le traitement de donnéesqui sont supposéesêtrela réalisation de variables aléatoires.Les hypothèsesportent en généralsur la distribution de probabilitéssous-jacente aux données.Un domaine de la Statistique appelé\statistique robuste" consiste àdéfinir des mesures de robustesse et àproposer des méthodes robustes au sens des mesures précédemment définies. Aprèsavoir présentédes notions de base en statistique, nous donnons différentes mesures de robustesse bien connues dans le domaine. Pour illustrer les idées,nous considéronsle cas simple de l'estimation d'une \moyenne" par la moyenne arithmétique,la médiane,la moyenne tronquéeet la moyenne winzorisée.Le papier se veut un exposéintroductif àla statistique robuste qui s'adresse àdes lecteurs non nécessairement statisticiens mais il contient toutefois des détailstechniques. Le lecteur qui n'est pas intéressépar ces détailspourra se contenter de lire l'introduction des différentes sections et sous-sections. 1. Introduction. Statistics is a branch of mathematics which deals with the collection and the analysis of data bases. Statistical analysis helps making decision in scientific, industrial or societal problems and is either descriptive or inferential. Descriptive statistics consist in summaries and descriptions of the collected data sample using some graphical or numerical measures. Inferential statistics use probability theory in order to deduce properties of a population from the properties of a sample drawn from the population. The data are assumed to be generated according a Data Generating Process (DGP). Assumptions are made on the probability distributions that underlie the DGP leading to the definition of some models. For example, data are assumed to be generated by a gaussian or normal AMS 2000 subject classifications: Primary 6202, 62F35; secondary 62F10 Keywords and phrases: breakdown point, influence function, maxbias, median, trimmed mean, winsorized mean 1 2 A. RUIZ-GAZEN distribution. Statistical methods are defined and studied under these assumptions which are more or less stringent according to the model. In practice these assumptions are usually not verified by the DGP and statisticians face some robustness questions: what is the behavior of a given statistical method when the model is not completely valid? How can we measure such a behavior in order to compare different methods? If the existing methods are not reliable for small assumptions violations, is it possible to propose more reliable methods? The Data Generating Process may deviate from the assumed model in different ways. In this paper, the deviation comes from the fact that all the observed data do not follow the assumed probabilistic distribution. The model is well adapted to a majority of the data but not to all of them. The statistician is still interested in the model fitted by the majority of the observations but he has to cope with a fraction of observations that follow a different distribution than the assumed one. This small fraction of observations are usually called outliers or outlying observations. In practice such a deviation is not unusual: it is frequent that data bases contain gross errors because of measurement or copying errors for instance. In this context, a robust statistical procedure is one which is not too much influenced by potential outlying observations. What is precisely meant by \not too much influenced” will be detailed in Section 4. The type of deviation we focus on may seem not general enough but it already leads to many statistical complications that are not easily overcome and which make the field of robust statistics challenging and complex from a theoretical and an applied point of view. Moreover, it is important to point out that the robustness of a procedure is always relative to a given model and, consequently, robustness measures and methods are to be proposed for each particular statistical model. And there exist many different models depending on the type of data and the objectives of the analysis! In the present paper, only a simple problem and model are tackled which consist in estimating the mean of a gaussian distribution with simple and intuitive estimators. Note that the whole robustness theory on this \simple" problem is already much more intricate than the material we propose below. The problem of the sensitivity of usual statistical methods in presence of outliers has been already considered many years ago (see for instance [15], [16] and also [6] and discussion for a historic perspective). The theory discussed in the present paper is the one introduced by Huber and Hampel in the sixties and seventies of the previous century in the continuity of Tukey's work ([17]). Their work with some coauthors is described in detail in the books [11] and [8]. There exist thousands of references on robust statistics and some references will be given all along the present paper but many essential references will not be present. For example, the recent books by [12] and [9] will not be quoted anymore in what follows. In order not to give a too superficial insight into robust statistics, some notions of statistics are needed. Some basic concepts are given in section 2 while general principles of robust statistics are detailed in section 3. Only a few formula are given and all of them are explained in simple words. The interested reader should refer to the literature for details and a more mathematically rigorous presentation. 2. Some basic statistical concepts. Statistics encompass many different \domains" for which adapted methods have been developed. Concerning the collection of data, there exist the domains of survey sampling and experimental design. Concerning data analysis, it is not easy to give an exhaustive list but there are among others, linear and generalized linear models, time series and stochastic processes, spatial statistics, survival analysis and quality control. For each of these domains most of the statistical inferential procedures can be divided in three different types: estimation, prediction and hypotheses testing. In the following we will ROBUST STATISTICS 3 only consider estimation problems. In particular, as a simple but illustrative example, we will consider the problem of estimating a \mean" parameter for some simple \parametric model". In statistics, parametric models are to be distinguished from non-parametric models. The former models are precisely defined up to a finite number of parameters that need to be estimated from the data while the latter are more flexible but usually do not lead to a complete description of the structure of the data set. The advantage of nonparametric statistics over parametric statistics is that no precise assumption on the distribution of the data is necessary but on the other hand, parametric models describe the data completely and are easier to interpret. Robust statistics have been mostly devoted to parametric models but the theory can be applied as soon as the quantities to estimate are defined as functional of the probability distribution of the DGP. The work in [4] gives an example of a robustness study in a nonparametric context for estimating production frontiers. In other respects, it is sometimes wrongly believed that nonparametric statistics are robust simply because they do not rest on precise distribution assumptions but it is not true in general. Nonparametric procedures may be very sensitive to a few outliers present in the data as exemplified in [1] and [4]. Note that statistics can also be divided into two main streams which are the frequentist and the Bayesian approaches. In the following, only frequentist methods are considered but the interested reader can refer to [8] (p. 55) or [11] (chapter 15) for a discussion on robust Bayesian methods. Let us now focus on the problem of estimating parameters in a parametric model. A very simple example consists in assuming that the data follow a normal distribution denoted by N . It is well known from probability theory that such a distribution is characterized by its first two moments namely the mean µ that can be any real number and the variance σ2 that can be any positive number.

Robust Statistics: a Functional Approach

Lecture 12 Robust Estimation

Outlier Identification.Pdf

Robustness of Parametric and Nonparametric Tests Under Non-Normality for Two Independent Sample

Robust Methods in Biostatistics

A Guide to Robust Statistical Methods in Neuroscience

Robust Statistical Methods in R Using the WRS2 Package

Robustbase: Basic Robust Statistics

Robust Statistics in BD Facsdiva™ Software

Application of Robust Regression and Bootstrap in Productivity Analysis of GERD Variable in EU27

Robust Statistics

Winsorized and Smoothed Estimation of the Location Model in Mixed Variables Discrimination

Robust Analysis of the Central Tendency, ISSN Impresa (Printed) 2011-2084 Simple and Multiple Regression and ANOVA: a Step by Step Tutorial