
risks Article Linear Regression for Heavy Tails Guus Balkema 1,* and Paul Embrechts 2 1 Department of Mathematics, Universiteit van Amsterdam, 1098xh Amsterdam, The Netherlands 2 RiskLab, Department of Mathematics, ETH Zurich, 8092 Zurich, Switzerland; [email protected] * Correspondence: [email protected] Received: 29 June 2018; Accepted: 21 August 2018; Published: 10 September 2018 Abstract: There exist several estimators of the regression line in the simple linear regression: Least Squares, Least Absolute Deviation, Right Median, Theil–Sen, Weighted Balance, and Least Trimmed Squares. Their performance for heavy tails is compared below on the basis of a quadratic loss function. The case where the explanatory variable is the inverse of a standard uniform variable and where the error has a Cauchy distribution plays a central role, but heavier and lighter tails are also considered. Tables list the empirical sd and bias for ten batches of one hundred thousand simulations when the explanatory variable has a Pareto distribution and the error has a symmetric Student distribution or a one-sided Pareto distribution for various tail indices. The results in the tables may be used as benchmarks. The sample size is n = 100 but results for n = ¥ are also presented. The error in the estimate of the slope tneed not be asymptotically normal. For symmetric errors, the symmetric generalized beta prime densities often give a good fit. Keywords: exponential generalized beta prime; generalized beta prime; hyperbolic balance; least absolute deviation; least trimmed squares; Pareto distribution; right median; Theil–Sen; weighted balance Contents 1 Introduction 2 2 Background 6 3 Three Simple Estimators: LS, LAD and RMP 12 4 Weighted Balance Estimators 20 4.1 Three Examples.......................................... 21 4.2 The Monotonicity Lemma.................................... 22 4.3 LAD as a Weighted Balance Estimator............................. 24 4.4 Variations on LAD: LADPC, LADGC, and LADHC..................... 25 5 Theil’s Estimator and Kendall’s t 26 6 Trimming 27 7 Tables 30 7.1 The Empirical sd and Bias.................................... 31 7.2 Parameter Values......................................... 35 7.3 An Example............................................ 37 Risks 2018, 6, 93; doi:10.3390/risks6030093 www.mdpi.com/journal/risks Risks 2018, 6, 93 2 of 70 8 Conclusions 39 A Tails 42 A.1 Tails of aˆLAD ............................................ 42 A.2 Tails of aˆRM ............................................. 43 A.3 Tails of aˆWB0 ............................................ 44 A.4 Tails of aˆLADHC .......................................... 45 A.5 The Left Tail of Dn ........................................ 48 B The Poisson Point Process Model 49 B.1 Distributions and Densities of Poisson Point Processes.................... 49 B.2 Equivalence of the Distributions pa for x > 1/2........................ 50 B.3 Error Densities with Local Irregularities............................ 52 B.4 Two Plots.............................................. 53 B.5 Convergence for the LS Estimates................................ 55 B.6 Convergence for the RM Estimates............................... 57 C The EGBP Distributions 59 C.1 The Exponential Generalized Beta Prime Densities...................... 59 C.2 Basic Formulas.......................................... 61 C.3 The Closure of EGBP....................................... 62 C.4 The Symmetric Generalized Beta Prime Distributions.................... 63 C.5 The Parameters.......................................... 64 C.6 Fitting EGBP Distributions to Frequency Plots of log jaˆn(E)j ................ 64 C.7 Variations in the Error Density at the Origin.......................... 67 References 69 1. Introduction The paper treats the simple linear regression ∗ Yi = b + aXi + Yi i = 1, . , n (1) ∗ when the errors Yi are observations from a heavy tailed distribution, and the explanatory variables Xi too. In linear regression, the explanatory variables are often assumed to be equidistant on an interval. If the values are random, they may be uniformly distributed over an interval or normal or have some other distribution. In this paper, the explanatory variables are random. The Xi are inverse powers x of uniform variables Ui in (0, 1): Xi = 1/Ui . The variables Xi have a Pareto distribution with tail index x > 0. The tails become heavier as the index increases. For x ≥ 1, the expectation is infinite. ∗ We assume that the error variables Yi have heavy tails too, with tail index h > 0. The aim of this paper is twofold: • The paper compares a number of estimators E for the regression line in the case of heavy tails. The distribution of the error is Student or Pareto. The errors are scaled to have InterQuartile Distance IQD = 1. The tail index x of the Pareto distribution of the explanatory variable varies between zero and three; the tail index h of the error varies between zero and four. The performance of an estimator E is measured by the loss function L(u) = u2 applied to the difference between the slope a of the regression line and its estimate aˆE. Our approach is unorthodox. For various values of the tail indices x and h, we compute the average loss for ten batches of a hundred thousand simulations of a sample of size one hundred. Theorems and proofs are replaced by tables and programs. If the error has a symmetric distribution, the square root of the average loss is the Risks 2018, 6, 93 3 of 70 empirical sd (standard deviation). From the tables in Section7, it may be seen that, for good estimators, this sd depends on the tail index of the explanatory variables rather than the tail index of the error. As a rule of thumb, the sd is of the order of 1/10x+1 0 ≤ x ≤ 3, 0 ≤ h ≤ 4, n = 100. (2) This crude approximation is also valid for errors with a Pareto distribution. It may be used to determine whether an estimator of the regression line performs well for heavy tails. • The paper introduces a new class of non-linear estimators. A weighted balance estimator of the regression line is a bisector of the sample. For even sample size half the points lie below the bisector, half above. There are many bisectors. A weight sequence is used to select a bisector which yields a good estimate of the regression line. Weighted balance estimators for linear regression may be likened to the median for univariate samples. The LAD (Least Absolute Deviation) estimator is a weighted balance estimator. However, there exist weighted balance estimators which perform better when the explanatory variable has heavy tails. The results of our paper are exemplary rather than analytical. They describe the outcomes of an initial exploration on estimators for linear regression with heavy tails. The numerical results in the tables in Section7 may be regarded as benchmarks. They may be used to measure the performance of alternative estimators. Insight in the performance of estimators of the regression line for samples of size one hundred where the explanatory variable has a Pareto distribution and the error a Student or Pareto distribution may help to select a good estimator in the case of heavy tails. The literature on the LAD (Least Absolute Deviation) estimator is extensive (see Dielman(2005)). The theory for the TS (Theil–Sen) estimator is less well developed, even though TS is widely used for data which may have heavy tails, as is apparent from a search on the Internet. A comparison of the performance of these two estimators is overdue. When the tail indices x and h are positive, outliers occur naturally. Their effect on estimates has been studied in many papers. A major concern is whether an outlier should be accepted as a sample point. In simulations, contamination does not play a role. In this paper, outliers do not receive special attention. Robust statistics does not apply here. If a good fairy were to delete all outliers, that would incommode us. It is precisely the outliers which allow us to position the underlying distribution in the (x, h)-domain and select the appropriate estimator. Equation (2) makes no sense in robust regression. Our procedure for comparing estimators by computing the average loss over several batches of a large number of simulations relies on uncontaminated samples. This does not mean that we ignore the literature on robust regression. Robust regression estimates may serve as initial estimates. (This approach does not do justice to the special nature of robust regression, which aims at providing good estimates of the regression line when working with contaminated data.) In our paper, we have chosen a small number of geometric estimators of the regression line, whose performance is then compared for a symmetric and an asymmetric error distribution at various points in the x, h-domain, see Figure1a. In robust regression, one distinguishes M-, R- and L-estimators. We treat the M-estimators LS and LAD. These minimize the lp distance of the residuals for p = 2 and p = 1, respectively. We have not looked at other values of p 2 [1, ¥). Tukey’s biweight and Huber’s Method are non-geometric M-estimators since the estimate depends on the scaling on the vertical axis. The R-estimators of Jaeckel and Jureˇckováare variations on the LAD estimator. They are less sensitive to the behaviour of the density at the median, as we show in Section3. They are related to the weighted balance estimators WB40, and are discussed in Section4. Least Trimmed Squares ( LTS) was introduced in Rousseeuw(1984). It is a robust version of least squares. It is a geometric L estimator. Least Median Squares introduced in the same paper yields the central line of a closed strip containing fifty of the hundred sample points. It selects the strip with minimal vertical width. If the error has a symmetric unimodal density one may add the extra condition that there are twenty five sample points on either side of the strip. This estimator was investigated in a recent paper Postnikov and Sokolov(2015). Risks 2018, 6, 93 4 of 70 Maximum Likelihood may be used if the error distribution is known.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages70 Page
-
File Size-