
MEASUREMENT SCIENCE REVIEW, 20, (2020), No. 1, 6–14 Journal homepage: https://content.sciendo.com On Robust Estimation of Error Variance in (Highly) Robust Regression Jan Kalina1;2, Jan Tichavský1 1The Czech Academy of Sciences, Institute of Computer Science, Pod Vodárenskou vˇeží2, 182 07 Praha 8, Czech Republic, [email protected] 2The Czech Academy of Sciences, Institute of Information Theory and Automation, Pod Vodárenskou vˇeží4, 182 00 Praha 8, Czech Republic The linear regression model requires robust estimation of parameters, if the measured data are contaminated by outlying measurements (outliers). While a number of robust estimators (i.e. resistant to outliers) have been proposed, this paper is focused on estimating the variance of the random regression errors. We particularly focus on the least weighted squares estimator, for which we review its properties and propose new weighting schemes together with corresponding estimates for the variance of disturbances. An illustrative example revealing the idea of the estimator to down-weight individual measurements is presented. Further, two numerical simulations presented here allow to compare various estimators. They verify the theoretical results for the least weighted squares to be meaningful. MM-estimators turn out to yield the best results in the simulations in terms of both accuracy and precision. The least weighted squares (with suitable weights) remain only slightly behind in terms of the mean square error and are able to outperform the much more popular least trimmed squares estimator, especially for smaller sample sizes. Keywords: High robustness, robust regression, outliers, variance of errors, least weighted squares, simulation. 1. INTRODUCTION tions), while there are established tools for estimating the un- The simplicity and interpretability of the linear regression certainty of measurements in different situations (see e.g. [4]). model has resulted in an enormous number of applications in We can say that real measurements often suffer from out- modeling real data. The aim of linear regression is to model lying values (outliers), which remain however only vaguely a continuous variable taking into account one or more inde- defined. If the linear model exists objectively, then any mea- pendent variables (regressors). Regression can be used to pre- surement, for which the considered linear regression model dict values of the response based on such values of one or does not objectively hold, may be perceived as an outlier in more (continuous and/or categorical) variables, which are not the model. If the linear model represents only our approxi- available. Already Carl Friedrich Gauss was using regres- mation to reality, then any anomalous measurements not fit- sion for equalization of networks constructed from geodesic ted well by the model play the role of outliers. In general, measurements and thus he connected to the two intertwined outliers typically appear in real data in across various disci- science disciplines, namely measurement science and statisti- plines, e.g. in engineering applications [5] or image analysis cal point estimation. Later, Ronald A. Fisher or Francis Gal- based on markers measured within images [6]. Outliers ap- ton, who both contributed to development of regression anal- pear practically always in measurements of molecular genetic ysis, were also passionate about measurement [1]. Regression and metabolomic biomarkers, for which severe measurement analysis has become an important part of analyzing measure- errors are immanent to the measurement technology [7], [8]. ments by means of exploratory data analysis (EDA) or pre- Thus, there emerges a need for robust estimation of regression dictive data mining, with numerous applications not limited parameters. to calibration of measuring instruments [2], estimating miss- While the least squares estimator in linear regression is no- ing values, or detecting mixtures of two (or more) populations toriously known to suffer from the presence of outlying values or clusters of data [3]. in the data, numerous robust regression estimators have been Real measurements can be influenced by (random or sys- proposed as resistant alternatives [9], including various recent tematic) errors coming from different sources (e.g. because tools tailor-made for very specific tasks. The concept of ro- some measurements are performed under different condi- bust estimation is traditionally tied to resistance with respect to outliers. However, there remains a gap in the area of esti- DOI: 10.2478/msr-2020-0002 6 MEASUREMENT SCIENCE REVIEW, 20, (2020), No. 1, 6–14 mating the variance of the random regression errors (i.e. error bust estimators, which naturally generalize maximum likeli- variance), as there seem no systematic comparisons among hood estimation principles. In fact, studentized M-estimators different robust estimates. were proposed, which had to rely on an initial estimate of s 2, Within regression modeling, it is important to estimate the which plays the role of a nuisance parameter [9]. This reveals regression parameters (say b) together with the variance of er- the importance of estimating s 2 and we focus here only on rors, commonly denoted as s 2 (if this is assumed to exist). If such estimates, which are able to estimate b and s 2 jointly. the measurements are contaminated by outliers, then it is nec- They are suitable for the so-called contaminated normal dis- essary to estimate s 2 in a highly robust way. It is possible to tribution corresponding to mixture of normal errors with out- accompany each of the commonly used robust estimators of b liers [13], or for errors with a heavy tailed distribution. by a corresponding estimator of s 2, which allows to evaluate This section recalls the least squares and its weighted ver- the precision (uncertainty) of the regression fit. Estimation sion together with several (possibly highly) robust estima- of s 2 within robust regression is crucial for robust hypothe- tors, namely S-estimators, MM-estimators and least trimmed sis tests about the significance of regression parameters, for squares. In addition, we present new ideas for the least the corresponding confidence intervals for the parameters, for weighted squares estimator. We also cite references, where outlier detection or comparing the efficiency of various robust the consistency of these estimators of b and corresponding estimates. Outlier detection based on robust estimates of s 2 estimators of s 2 was proven; usually it is needed to as- has been popularized by the seminal book [10] and has been sume that the distribution of errors is continuous and sym- several times used in practical applications [11]. metric around 0, apart from additional technical assumptions. In this paper, we work with several important classes of Highly robust estimators are defined as those, which may at- (possibly highly) robust regression estimators, including S- tain a high value of the breakdown point. We can say that estimators, MM-estimators, or the least trimmed squares es- the breakdown point, which represents a fundamental con- timator. Nevertheless, we are primarily interested in the least cept of robust statistics [9], is a measure of robustness of weighted squares estimator, which turns out to have appeal- a statistical estimator of an unknown parameter. Formally, the ing statistical properties [12]. As recalled in Section2, their finite-sample breakdown point evaluates the minimal fraction corresponding estimators of s 2 are consistent (under spe- of data that can drive an estimator beyond all bounds when cific assumptions). We are particularly interested in the least set to arbitrary values. weighted squares estimator, which remains much less known compared to other available robust estimators. For the least 2.1. Least squares and weighted least squares weighted squares, we propose new weighting schemes to- The least squares estimator with the analytical expression gether with corresponding estimates for the variance of the T −1 T bLS = (X X) X Y is vulnerable to the presence of outliers regression errors. We present an illustrative example in Sec- in the data. To avoid further confusions, let us also recall its tion3 and numerical simulations in Section4 comparing the weighted version with given non-negative weights w1;:::;wn variance of the errors obtained by several highly robust esti- n fulfilling ∑i=1 wi = 1: Let ui(b) denote the residual of the i-th mators. Finally, Section5 concludes the paper. T measurement based on a given estimate b = (b0;b1;:::;bp) of b, i.e. 2. SUBJECT & METHODS We consider the standard linear regression model ui(b) = Yi − b0 − b1Xi1 − ··· − bpXip; i = 1;:::;n: (2) While the least squares estimator is suitable for measurements Yi = b0 + b1Xi1 + ··· + bpXip + ei; i = 1;:::;n; (1) with the same precision, the weighted least squares (WLS) es- where n denotes the total number of measurements avail- timator, also known as Aitken estimator or generalized least T able for a continuous response Y = (Y1;:::;Yn) and for p squares, represents an analogy for differently precise mea- independent variables (regressors), which may be either ran- surements, however with a known precision. Formally, the dom or deterministic. We will use the usual notation Xi = WLS estimator is obtained as T T T T (Xi1;:::;Xip) and X = (X1 ;:::;Xn ) . The random errors n 2 (disturbances) e1;:::;en are assumed to be independent and bWLS = argmin ∑ wiui (b); (3) identically distributed (i.i.d.) with a distribution function F, b2Rp+1 i=1 which is absolutely continuous with the corresponding prob- 2 ability density function f . Throughout the paper, we assume i.e. by minimization of a weighted estimate of s over b. 2 Equivalently, the WLS estimator is obtained as the solution Ee1 to exist and denote the common variance of the errors 2 of the set of normal equations by s := vare1. The statistical concept of variability corre- sponds to the precision in measurement science, as it holds n 2 T varY1 = vare1 = s for measurements with non-random re- ∑ wiXi(Yi − Xi b) = 0: (4) gressors.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages9 Page
-
File Size-