Comparative the Performance of Robust Linear Regression Methods 817
Total Page:16
File Type:pdf, Size:1020Kb
Applied Mathematical Sciences, Vol. 13, 2019, no. 17, 815 - 822 HIKARI Ltd, www.m-hikari.com https://doi.org/10.12988/ams.2019.97106 Comparative the Performance of Robust Linear Regression Methods and its Application Unchalee Tonggumneada and Nikorn Saengngamb a Department of Mathematics and Computer Science, Faculty of Science and Technology, Rajamangala University of Technology Thanyaburi, Phatum Thani 12110, Thailand b Department of Technical Education , Faculty of Technical Education, Rajamangala University of Technology Thanyaburi, Phatum Thani 12110, Thailand This article is distributed under the Creative Commons by-nc-nd Attribution License. Copyright © 2019 Hikari Ltd. Abstract This research aims to apply the robust estimation method with actual data. M- estimation, quantile regression and Random Sample Consensus (RANSAC) are considered. The result show that, when no outlier, the parameter estimate using M- estimation, quantile regression and random sample consensus (RANSAC) produce a very similary. When the rate of outliers increasing is also conducted the parameter estimate using RANSAC method more different from M- estimation and quantile regression. The RANSAC method is more robust to outlier than the M- estimation and quantile regression method. Which can be clearly seen that, The RANSAC methods are more robust against the outlier than M- estimation and quantile regression ,because in parameter estimation do not include the outlier into the model, only inliers are used to run the model. Keywords: robust regression, M- estimation, quantile regression, Random Sample Consensus. 1 Introduction Regression analysis is the common and useful statistical tool which can be used to quantify the relationship between a response variable (y) and explanatory variables (x), Ordinary Least Square (OLS) is A common used for parameter estimate. However, OLS is an effective method under the assumption of regression analysis. Namely, The regression model is linear in the coefficients and 816 Unchalee Tonggumnead and Nikorn Saengngam the error term. The error term (i ) is vector of independent normal random variables with mean zero, cov(ij , ) 0 .The error term has a constant variance 2 (no heteroscedasticity ) that is V ()i . Usually, estimation and hypothesis testing about the parameter value often used OLS, due to these assumptions of the OLS method gives a good estimation called the Best Linear Unbiased Estimator (BLUE) , and provide the test statistics with a high power of the test. However, in reality these assumptions may not be fulfilled by the actual data collected. For instance, the data may involve anomalous data or heavy-tailed distributions. Thus, if a parameter estimation model is sensitive to anomalies in the data or distribution patterns, its reliability will eventually deteriorate. To address sensitivity issues, several techniques have been suggested, including attaching less weight to errors or disregarding them altogether, but a better solution is to construct a regression model that is robust to errors. This approach has been constantly developed during the past ten years to tackle estimation problems arising from outliers and leverage points. A case in point is M-estimation. M- estimation was introduced by [1] as a general framework for robust estimation. [2] extended the use of M-estimation to M-quantile regression. Another type of robust regression estimation models is quantile regression, quantile regression is a type of regression analysis that very useful when the rate of change in the conditional quantile can be explained using the regression coefficient depend on the quantile, this method is more robust to outliers [3]. However, these robust regression methods fit the model with the overall outlier in the model. If there is a large amount of outlier in the data set, it will reduce the reliability of the model. For this reason, their reliability will likely still worsen with data that contain a high proportion of outliers. Another option that help solve the problem is Random Sample Consensus (RANSAC) , which the preliminary assumptions of RANSAC is the training data consists of inliers that can explained with the model and outliers that are gross-erroneous samples which don’t fit the model at all [4]. So, RANSAC trains the parametric model only with inliers while ignoring outliers. The objective of the present study is finding the effective robust method. M estimation, quantile regression and Random Sample Consensus (RANSAC) are considered. 2. Method 2.1 M estimation The principle of M estimation is minimize function : k y x kni ij j ˆ j0 Mmin (y i x ij j ) min ( ) . (1) ji01 st ˆ Find the 1 derivative of M compared with , so that: Comparative the performance of robust linear regression methods 817 k y x n i ij j0 xjij ( ) 0, 0,1,...,k. i1 (2) From [5] and [6] determine the weight function, so that equation (2) will be in the form: nk xij w i( y i x ij ) 0 ,j 0,1,...,k . (3) ij11 ˆ From iteratively reweighted least squares (IRLS), calculate the initial value 0 and ˆ0 is a scale estimate , so that nk 00 xij w i( y i x ij ) 0 . (4) ij11 yields the following matrix form of (4): XTT WX X WY . (5) Equation (5) can be written as : ˆ ()()XT WX1 XWY . When W is nn weight matrix, W diag({ wi : i 1,2,..., n }) as follows: w1 0 0 0 w 0 W 2 0 0 wn 2.2 Quantile regression Quantile regression is a type of regression analysis intended to estimate, and conduct inference about conditional quantile functions. Quantile regression methods is a mechanism for estimating models for the conditional median function. For a random variable Y with the probability distribution function: F()() y P Y y . (6) The th quantile of Y can be determined using the inverse function: Q( ) inf[ y : F ( y ) ]. (7) For (0,1) , the conditional quantile function can be written as: 818 Unchalee Tonggumnead and Nikorn Saengngam n ˆ ( ) arg min (YXii ) . (8) R i1 [3] and [4] 2.3 Random Sample Consensus (RANSAC) Random Sample Consensus (RANSAC) is a technique proposed by [4]. RANSAC is a resampling technique that generates candidate solutions by using the minimum number observations (data points) required to estimate the underlying model parameters with the assumption: the training data consists of inliers that can explained with the model and outliers that are gross-erroneous sample which don’t fit the model at all. So using outliers when training the model would final prediction error. involves the following steps. 1: Select randomly the minimum number of points required to define the model. Call this subset the hypothetical inliers. 2: Solve for the parameters of the model to the set of hypothetical inliers. 3: The error function or the distance threshold is calculated under a zero- 2 2 mean Gaussian noise with a standard deviation , where t 3.84 . Should the n data points fitting the model in 2 be less erroneous than t, more data points will be added to the inliers. 4: Steps 1-3 are iterated until the n data points in the outliers are equal to or greater than d. The variable d represents the minimum number of samples to be accepted as a valid consensus to generate a final model for the iteration in the equation: d w data set Where data set represents the total number of samples in the data set and w represents the probability of choosing an inlier 3. Application of the data (a) (b) Comparative the performance of robust linear regression methods 819 (c ) (d) Figure1. (a) Illustrates the first data set, the independent variable X is the average monthly household incomes, the dependent variable Y is the average monthly household expenditures, of all 77 Thai provinces in 2015 (b) the second data set, the independent variable X is the number of rainy days, the dependent variable Y is the rainfall mm, of all 47 Thai provinces in 2013 (c) the third data set, the independent variable X is the relative humidity, the dependent variable Y is the rainfall mm, of all 47 Thai provinces in 2012 (d) Illustrates the fourth data set, the independent variable X is the number of people using water, the dependent variable Y production capacity water supply volume m3, of 74 Thai provinces in 2019 [8],[9] and [10]. Table 1. Outlier diagnostics . Data Number of outlier Std. residual outlier 1 0 - < 3.00 all cases 2 1 No.44 4.707 3 1 No.44 4.762 4 2 No.8 5.441 No.27 -4.825 Outlier diagnostics shown in Table 1. When we diagnostics the outlier using the criteria Std. residual> 3.00 , will assume that the data is an outlier. The results show that the first data set not found an outlier. While, the second the third and the fourth data set have 1 and 2 outliers respectively. 820 Unchalee Tonggumnead and Nikorn Saengngam (a) (b) (c ) (d) Figure 2 (a) Illustrates the scatter plot and estimate regression curves with M- estimation, quantile regression and RANSAC of the average monthly household incomes and the average monthly household expenditures, of all 77 Thai provinces in 2015, M-estimation is represented by blue double dash line, quantile regression is represented by green long dotted line, and RANSAC represented by red dash line. (b) Illustrates the scatter plot and estimate regression curves with M-estimation, quantile regression and RANSAC of the number of rainy days and the the rainfall mm, of all 47 Thai provinces in 2013, M-estimation is represented by blue double dash line, quantile regression is represented by green long dotted line, and RANSAC represented by red dash line. (c) Illustrates the scatter plot and estimate regression curves with M-estimation, quantile regression and RANSAC of the relative humidity and the rainfall mm, of all 47 Thai provinces in 2012, M-estimation is represented by blue double dash line, quantile regression is represented by green long dotted line, and RANSAC represented by red dash line.