Robust Regression Estimators When There Are Tied Values
Total Page:16
File Type:pdf, Size:1020Kb
ROBUST REGRESSION ESTIMATORS WHEN THERE ARE TIED VALUES Rand R. Wilcox Dept of Psychology University of Southern California Florence Clark Division of Occupational Science & Occupational Therapy University of Southern California June 13, 2013 1 ABSTRACT There is a vast literature on robust regression estimators, the bulk of which assumes that the dependent (outcome) variable is continuous. Evidently, there are no published results on the impact of tied values in terms of the Type I error probability. A minor goal in this paper to comment generally on problems created by tied values when testing the hypothesis of a zero slope. Simulations indicate that when tied values are common, there are practical problems with Yohai's MM-estimator, the Coakley{Hettmansperger M-estimator, the least trimmed squares estimator, the Koenker{Bassett quantile regression estimator, Jaeckel's rank-based estimator as well as the Theil{Sen estimator. The main goal is to suggest a modification of the Theil{Sen estimator that avoids the problems associated with the estimators just listed. Results on the small-sample efficiency of the modified Theil-Sen estimator are reported as well. Data from the Well Elderly 2 Study are used to illustrate that the modified Theil-Sen estimator can make a practical difference. Keywords: tied values, Harrell{Davis estimator, MM-estimator, Coakley{Hettmansperger estimator, rank-based regression, Theil{Sen estimator, Well Elderly 2 Study, perceived con- trol 1 Introduction It is well known that the ordinary least squares (OLS) regression estimator is not robust (e.g., Hampel et al., 1987; Huber & Ronchetti, 2009; Maronna et al. 2006; Staudte & Sheather, 1990; Wilcox, 2012a, b). A single outlier can result in a poor fit to the bulk of the data and outliers among the dependent variable can result in relatively poor power. Numerous robust regression estimators have been derived that are aimed at dealing with known concerns with the OLS estimator, which are described in the books just cited. But the bulk of the published results on robust regression estimators assume the dependent 2 variable is continuous: the impact of tied values has, apparently, been ignored. As is evident, in many situations tied values occur by necessity. Consider, for example, a dependent variable that is the sum of scores associated with a collection of Likert scales. So the sample space consists of some finite set of integers having cardinality N. If the sample size is n > N, then of course tied values must occur with the number of tied values increasing as n gets large. Note that the mere fact that a sample space is discrete does not imply that concerns regarding outliers and skewed distributions are no longer relevant. In the Well Elderly 2 study (Jackson, et al. 2009), for example, which motivated this paper, several dependent variables are based on the sum associated with a collection of Likert scales. Typically about 8% to 12% of the values were flagged as outliers using the so-called MAD- median rule (e.g., Wilcox, 2012a, section 2.4.2). This motivated consideration of robust regression estimator, but there was uncertainty regarding the ability of robust methods to control the probability of a Type I error when tied values occur due to a finite sample space. Simulations were attempted as a partial check on this issue for the situations described in section 3 of this paper. But it was found that, among the estimators that were considered, either computational problems arose in some instances, making it impossible to check on the actual level, or control over the Type I error probability was poor. The estimators that were considered included Yohai's (1987) MM-estimator, Rousseeuw's (1984) least trimmed squares (LTS) estimator, the Coakley and Hettmansperger (1993) M-estimator, the Theil (1950) and Sen (1964) estimator, Koenker and Bassett (1978) quantile estimator and a rank-based estimator stemming from Jaeckel (1972). The MM-estimator and the LTS estimator were applied via the R package robustbase, the quantile regression estimator was applied via the R package quantreg, the rank-based estimator was applied using the R package Rfit, and the Coakley{Hettmansperger and Theil{ Sen estimators were applied via the R package WRS. A percentile bootstrap method was used to test the hypothesis of a zero slope, which has been found to perform relatively well, in terms of controlling the probability of a Type I error, compared to other strategies that have 3 been studied (Wilcox, 2012b). The MM-estimator and Coakley{Hettmansperger estimator routinely terminated due to some computational issue. The other estimators had estimated Type I errors well below the nominal level. This problem actually got worse as the sample size increased when using Theil{Sen and rank-based methods for reasons explained in section 2 of this paper. The R package Rfit includes a non-bootstrap test of the hypothesis that the slope is zero. Again the actual level was found to be substantially less than the nominal level in various situations, and increasing n only made matters worse. The goal in this paper is to suggest a simple modification of the Theil{Sen estimator that avoids the problems just indicated. Section 2 reviews the Theil{Sen estimator and illustrates why it can be highly unsatisfactory. Then the proposed modification is described. Section 3 summarizes simulation estimates of the actual Type I error probability when testing at the .05 level and it reports some results on its small-sample efficiency. Section 4 uses data from Well Elderly 2 study to illustrate that the modified Theil{Sen estimator can make a substantial practical difference. 2 The Theil{Sen Estimator and the Suggested Modi- fication When the dependent variable is continuous, the Theil{Sen estimator enjoys good theoretical properties and in performs well in simulations in terms of power and Type I error probabil- ities when testing hypotheses about the slope (e.g., Wilcox, 2012b). Its mean squared error and small-sample efficiency compare well to the OLS estimator as well as other robust es- timators that have been derived (Dietz, 1987; Wilcox, 1998). Dietz (1989) established that its asymptotic breakdown point is approximately .29. Roughly, about 29% of the points must be changed in order to make the estimate of the slope arbitrarily large or small. Other asymptotic properties have been studied by Wang (2005) and Peng et al. (2008). Akritas 4 et al. (1995) applied it to astronomical data and Fernandes and Leblanc (2005) to remote sensing. Although the bulk of the results on the Theil{Sen estimator deal with situations where the dependent variable is continuous, an exception is the paper by Peng et al. (2008) that includes results when dealing a discontinuous error term. They show that when the distribution of the error term is discontinuous, the Theil{Sen estimator can be super effi- cient. They also establish that even in the continuous case, the slope estimator may or may not be asymptotically normal. Peng et al. also establish the strong consistency and the asymptotic distribution of the Theil{Sen estimator for a general error distribution. Cur- rently, a basic percentile bootstrap seems best when testing hypotheses about the slope and intercept, which has been found to perform well even when the error term is heteroscedastic (e.g., Wilcox, 2012b). The Theil{Sen estimate of the slope is the usual sample median based on all of the slopes associated with any two distinct points. Consequently, results dealing with inferential methods based on the sample median (Wilcox, 2012a, section 4.10.4) raise concerns about the more obvious methods for testing hypotheses when dealing with a discrete distribution, including the percentile bootstrap method that has been found to perform well when the dependent variable is continuous. More precisely, when dealing with a discrete distribution, the sampling distribution of the sample median can be highly discrete, and indeed it can become more discrete (the cardinality of the sample space can decrease) as the sample size increases. This same problem occurs when using the Theil{Sen estimator, which in turn has a negative impact on the probability of a Type I error, even as the sample size increases, as will be illustrated. The main goal is to report simulation results indicating that a slight modification of the Theil{Sen estimator avoids the problems just described. The modification consists of replacing the usual sample median with the Harrell and Davis (1982) estimate of the median. The Harrell{Davis estimator uses a weighted average of all the order statistics suggesting that it might avoid the problems associated with the Theil{Sen estimator that are described 5 Figure 1: Plot of 5000 estimates of the slope using the Theil{Sen estimator, n = 20. Figure 2: Plot of 5000 estimates of the slope using the Theil{Sen estimator, n = 60. in the next section of this paper. Section 3 summarizes simulation results on the Type I error probability when using the modified estimator. The small-sample efficiency of the modified estimator is studied as well. 2.1 The Distribution of the Theil{Sen Estimator When the De- pendent Variable is Discrete Before continuing, a property of the Theil{Sen estimator is illustrated that helps elucidate why it can perform poorly in simulations. Suppose, for example, that Y has a beta-binomial distribution and that X and Y are independent. For the moment, focus on the beta-binomial distribution B(10; 1; 9), which is described in detail in section 3. Figure 1 shows a plot of 5000 sample estimates of the slope based on the Theil{Sen estimator with n = 20. Among the 5000 estimates of the slope, 4683 are equal to zero. Figure 2 shows a plot of 5000 estimates when n = 60.