Subsampling Method for Robust Estimation of Regression Models

Open Journal of Statistics, 2012, 2, 281-296 http://dx.doi.org/10.4236/ojs.2012.23034 Published Online July 2012 (http://www.SciRP.org/journal/ojs) Subsampling Method for Robust Estimation of Regression Models Min Tsao, Xiao Ling Department of Mathematics and Statistics, University of Victoria, Victoria, Canada Email: [email protected] Received March 29, 2012; revised April 30, 2012; accepted May 10, 2012 ABSTRACT We propose a subsampling method for robust estimation of regression models which is built on classical methods such as the least squares method. It makes use of the non-robust nature of the underlying classical method to find a good sample from regression data contaminated with outliers, and then applies the classical method to the good sample to produce robust estimates of the regression model parameters. The subsampling method is a computational method rooted in the bootstrap methodology which trades analytical treatment for intensive computation; it finds the good sample through repeated fitting of the regression model to many random subsamples of the contaminated data instead of through an analytical treatment of the outliers. The subsampling method can be applied to all regression models for which non-robust classical methods are available. In the present paper, we focus on the basic formulation and robustness property of the subsampling method that are valid for all regression models. We also discuss variations of the method and apply it to three examples involving three different regression models. Keywords: Subsampling Algorithm; Robust Regression; Outliers; Bootstrap; Goodness-of-Fit 1. Introduction attempts to identify the subsample which is free of outliers. Our method makes use of standard non-robust Robust estimation and inference for regression models is classical regression methods for both identifying the an important problem with a long history in robust statistics. outlier free subsamples and then estimating the regres- Earlier work on this problem is discussed in [1] and [2]. sion model with the outlier free subsamples. Specifically, The first book focusing on robust regression is [3] which suppose we have a sample consisting of mostly “good gives a thorough coverage of robust regression methods data points” from an ideal regression model and some developed prior to 1987. There have been many new outliers which are not generated by the ideal model, and developments in the last two decades. Reference [4] pro- we wish to estimate the ideal model. The basic idea of vides a good coverage on many recent robust regression our method is to consider subsamples taken without re- methods. Although there are now different robust methods placement from the contaminated sample and to identify, for various regression models, most existing methods among possibly many subsamples, “good subsamples” involve a quantitative measure of the outlyingness of which contain only good data points. Then estimate the individual observations which is used to formulate robust ideal regression model using only the good subsamples estimators. That is, contributions from individual observa- through a simple classical method. The identification of tions to the estimators are weighted depending on their good subsamples is accomplished through fitting the degrees of outlyingness. This weighting by outlyingness model to many subsamples with the classical method, is done either explicitly as in, for example, the GM- and then using a criterion, typically a goodness-of-fit estimators of [5] or implicitly as in the MM-estimator of measure that is sensitive to the presence of outliers, to [6] through the use of functions. determine whether the subsamples contain outliers. We In this paper, we introduce an alternative method for will refer to this method as the subsampling method. The robust regression which does not involve any explicit or subsampling method has three attractive aspects: 1) it is implicit notion of outlyingness of individual observations. based on elements of classical methods, and as such it Our alternative method focuses instead on the presence can be readily constructed to handle all regression models or absence of outliers in a subset (subsample) of a sam- for which non-robust classical methods are available, 2) ple, which does not require a quantitative characteri- under certain conditions, it provides unbiased estimators sation of outlyingness of individual observations, and for the ideal regression model parameters, and 3) it is Copyright © 2012 SciRes. OJS 282 M. TSAO, X. LING easy to implement as it does not involve the potentially EY ii=,, gx β (1) difficult task of formulating the outlyningness of indi- 1 q vidual observations and their weighting. where Yi is the response variable, Xi is the q 1 corresponding covariates vector, g xi ,:β is Point (3) above is particularly interesting as evaluating p the outlyingness of individual observations is tradi- the regression function and β is the regression tionally at the heart of robust methods, yet in the re- parameter vector. To accommodate different regression models, the distributions of Y and X are left un- gression context this task can be particularly difficult. To i i specified here. They are also not needed in our sub- further illustrate this point, denote by X ,Y an ii sequent discussions. observation where Y 1 is the response and X q i i Denote by S =,,,zz z a contaminated sam- the corresponding covariates vector. The outlyingness of N 12 N ple of N observations containing n “good data” ge- obseravtion X ,Y here is with respect to the under- ii nerated by model (1) and m “bad data” which are lying regression model, not with respect to a fixed point outliers not from the model. Here n and m are un- in q1 as is in the location problem. It may be an known integers that add up to NSS. Let n and m be outlier due to the outlyingness in either Xi or Yi or the (unknown) partition of S such that S contains both. In simple regression models where the underlying N n the n good data and Sm contains the m bad data. To models have nice geometric representations, such as the achieve robust estimation of β with SN , the subsam- linear or multiple linear regression models, the outlying- pling method first constructs a sample Sg to estimate ness of an X ,Y may be characterized by extending ii the unknown good data set Sn , and then applies a (non- measures of outlyingness for the location problem robust) classical estimator to Sg . The resulting estimator through for example the residuals. But in cases where Yi for β will be referred to as the subsampling estimator is discrete such as a binary, Poisson or multinomial re- or SUE for β . Clearly, a reliable and efficient Sg sponse, the geometry of the underlying models are com- which captures a high percentage of the good data points plicated and the outlyingness of Xii,Y may be dif- in Sn but none of the bad points in Sm is the key to ficult to formulate. With the subsampling methods, we the robustness and the efficiency of the SUE. The avoid the need to formulate the outlyingness of indivi- subsampling algorithm that we develop below is aimed at dual observations but instead focus on the consequence generating a reliable and efficient Sg . of outliers, that is, they typically lead to a poor fit. We take advantage of this observation to remove the outliers 2.1. The Subsampling Algorithm and hence achieve robust estimation of regression models. Let A be a random sample of size ns taken without It should be noted that traditionally the notion of an replacement from SN , which we will refer to as a sub- “outlier” is often associated with some underlying meas- sample of SN . The key idea of the subsampling method ure of outlyingness of individual observations. In the is to construct the estimator Sg for Sn by using a present paper, however, by “outliers” we simply mean sequence of subsamples from SN . data points that are not generated by the ideal model and To fix the idea, for some nns , let AAA123,,, will lead to a poor fit. Consequently, the removal of be an infinite sequence of independent random subsam- *** outliers is based on the quality of fit of subsamples, not a ples each of size ns . Let AAA12,,3 be the sub- measure of outlyingness of individual points. sequence of good subsamples, that is, subsamples which The rest of the paper is organized as follows. In Sec- do not contain any outliers. Each of these sequences tion 2, we set up notation and introduce the subsampling contains only a finite number of distinct subsamples. We method. In Section 3, we discuss asymptotic and robust- choose to work with a repeating but infinite sequence ness properties of the subsampling estimator under general instead of a finite sequence of distinct subsamples as that conditions not tied to a specific regression model. We finite number may be very large and the infinite sequence then apply the subsampling methods to three examples set-up provides the most convenient theoretical frame- involving three different regression models in Section 4. work as we will see below. Consider using the partial In Section 5, we discuss variations of the subsampling union method which may improve the efficiency and reliability j BA= * (2) of the method. We conclude with a few remarks in j i i=1 Section 6. Proofs are given in the Appendix. to estimate the good data set Sn . Clearly, B j is a subset of S . The following theorem gives the consistency 2. The Subsampling Method n of B j as an estimator for Sn . *** To set up notation, let zxii=,yi be a realization of a Theorem 1 With probability one, AAA123,,, has random vector ZXi=ii,Y satisfying regression model infinitely many elements and Copyright © 2012 SciRes.

Load more