Applied Mathematical Sciences, Vol. 13, 2019, no. 17, 815 - 822 HIKARI Ltd, www.m-hikari.com https://doi.org/10.12988/ams.2019.97106

Comparative the Performance of Robust Linear

Regression Methods and its Application

Unchalee Tonggumneada and Nikorn Saengngamb

a Department of Mathematics and Computer Science, Faculty of Science and Technology, Rajamangala University of Technology Thanyaburi, Phatum Thani 12110, Thailand

b Department of Technical Education , Faculty of Technical Education, Rajamangala University of Technology Thanyaburi, Phatum Thani 12110, Thailand

This article is distributed under the Creative Commons by-nc-nd Attribution License. Copyright © 2019 Hikari Ltd.

Abstract

This research aims to apply the robust estimation method with actual data. M- estimation, quantile regression and Random Sample Consensus (RANSAC) are considered. The result show that, when no , the parameter estimate using M- estimation, quantile regression and random sample consensus (RANSAC) produce a very similary. When the rate of increasing is also conducted the parameter estimate using RANSAC method more different from M- estimation and quantile regression. The RANSAC method is more robust to outlier than the M- estimation and quantile regression method. Which can be clearly seen that, The RANSAC methods are more robust against the outlier than M- estimation and quantile regression ,because in parameter estimation do not include the outlier into the model, only inliers are used to run the model.

Keywords: , M- estimation, quantile regression, Random Sample Consensus.

1 Introduction

Regression analysis is the common and useful statistical tool which can be used to quantify the relationship between a response variable (y) and explanatory variables (x), Ordinary Least Square (OLS) is A common used for parameter estimate. However, OLS is an effective method under the assumption of . Namely, The regression model is linear in the coefficients and 816 Unchalee Tonggumnead and Nikorn Saengngam

the error term. The error term (i ) is vector of independent normal random variables with mean zero, cov(ij , ) 0 .The error term has a constant variance 2 (no heteroscedasticity ) that is V ()i  . Usually, estimation and hypothesis testing about the parameter value often used OLS, due to these assumptions of the OLS method gives a good estimation called the Best Linear Unbiased Estimator (BLUE) , and provide the test statistics with a high power of the test. However, in reality these assumptions may not be fulfilled by the actual data collected. For instance, the data may involve anomalous data or heavy-tailed distributions. Thus, if a parameter estimation model is sensitive to anomalies in the data or distribution patterns, its reliability will eventually deteriorate. To address sensitivity issues, several techniques have been suggested, including attaching less weight to errors or disregarding them altogether, but a better solution is to construct a regression model that is robust to errors. This approach has been constantly developed during the past ten years to tackle estimation problems arising from outliers and leverage points. A case in point is M-estimation. M- estimation was introduced by [1] as a general framework for robust estimation. [2] extended the use of M-estimation to M-quantile regression. Another type of robust regression estimation models is quantile regression, quantile regression is a type of regression analysis that very useful when the rate of change in the conditional quantile can be explained using the regression coefficient depend on the quantile, this method is more robust to outliers [3]. However, these robust regression methods fit the model with the overall outlier in the model. If there is a large amount of outlier in the data set, it will reduce the reliability of the model. For this reason, their reliability will likely still worsen with data that contain a high proportion of outliers. Another option that help solve the problem is Random Sample Consensus (RANSAC) , which the preliminary assumptions of RANSAC is the training data consists of inliers that can explained with the model and outliers that are gross-erroneous samples which don’t fit the model at all [4]. So, RANSAC trains the parametric model only with inliers while ignoring outliers. The objective of the present study is finding the effective robust method. M estimation, quantile regression and Random Sample Consensus (RANSAC) are considered.

2. Method

2.1 M estimation

The principle of M estimation is minimize function : k y  x  kni ij j ˆ j0 Mmin  (y i x ij  j )  min  ( ) . (1)  ji01 st ˆ Find the 1 derivative of  M compared with  , so that: Comparative the performance of robust linear regression methods 817

k y  x  n i ij j0  xjij ( ) 0, 0,1,...,k. i1  (2)

From [5] and [6] determine the weight function, so that equation (2) will be in the form: nk xij w i( y i x ij  ) 0 ,j  0,1,...,k . (3) ij11 ˆ From iteratively reweighted least squares (IRLS), calculate the initial value 0 and ˆ0 is a scale estimate , so that

nk 00 xij w i( y i x ij  ) 0 . (4) ij11 yields the following matrix form of (4): XTT WX  X WY . (5) Equation (5) can be written as : ˆ  ()()XT WX1 XWY .

When W is nn weight matrix, W diag({ wi : i 1,2,..., n }) as follows:

w1 0 0  0 w 0 W  2   0 0 wn

2.2 Quantile regression

Quantile regression is a type of regression analysis intended to estimate, and conduct inference about conditional quantile functions. Quantile regression methods is a mechanism for estimating models for the conditional median function. For a random variable Y with the probability distribution function: F()() y P Y y . (6) The  th quantile of Y can be determined using the inverse function: Q( ) inf[ y : F ( y ) ]. (7) For  (0,1) , the conditional quantile function can be written as: 818 Unchalee Tonggumnead and Nikorn Saengngam

n ˆ  (  ) arg min  (YXii  ) . (8) R i1 [3] and [4]

2.3 Random Sample Consensus (RANSAC)

Random Sample Consensus (RANSAC) is a technique proposed by [4]. RANSAC is a technique that generates candidate solutions by using the minimum number observations (data points) required to estimate the underlying model parameters with the assumption: the training data consists of inliers that can explained with the model and outliers that are gross-erroneous sample which don’t fit the model at all. So using outliers when training the model would final prediction error. involves the following steps. 1: Select randomly the minimum number of points required to define the model. Call this subset the hypothetical inliers. 2: Solve for the parameters of the model to the set of hypothetical inliers. 3: The error function or the distance threshold is calculated under a zero- 2 2 mean Gaussian noise with a  , where t  3.84 . Should the n data points fitting the model in 2 be less erroneous than t, more data points will be added to the inliers. 4: Steps 1-3 are iterated until the n data points in the outliers are equal to or greater than d. The variable d represents the minimum number of samples to be accepted as a valid consensus to generate a final model for the iteration in the equation: d  w data set Where data set represents the total number of samples in the data set and w represents the probability of choosing an inlier

3. Application of the data

(a) (b)

Comparative the performance of robust linear regression methods 819

(c ) (d)

Figure1. (a) Illustrates the first data set, the independent variable X is the average monthly household incomes, the dependent variable Y is the average monthly household expenditures, of all 77 Thai provinces in 2015 (b) the second data set, the independent variable X is the number of rainy days, the dependent variable Y is the rainfall mm, of all 47 Thai provinces in 2013 (c) the third data set, the independent variable X is the relative humidity, the dependent variable Y is the rainfall mm, of all 47 Thai provinces in 2012 (d) Illustrates the fourth data set, the independent variable X is the number of people using water, the dependent variable Y production capacity water supply volume m3, of 74 Thai provinces in 2019 [8],[9] and [10].

Table 1. Outlier diagnostics .

Data Number of outlier Std. residual outlier 1 0 - < 3.00 all cases 2 1 No.44 4.707 3 1 No.44 4.762 4 2 No.8 5.441 No.27 -4.825

Outlier diagnostics shown in Table 1. When we diagnostics the outlier using the criteria Std. residual> 3.00 , will assume that the data is an outlier. The results show that the first data set not found an outlier. While, the second the third and the fourth data set have 1 and 2 outliers respectively.

820 Unchalee Tonggumnead and Nikorn Saengngam

(a) (b)

(c ) (d)

Figure 2 (a) Illustrates the scatter plot and estimate regression curves with M- estimation, quantile regression and RANSAC of the average monthly household incomes and the average monthly household expenditures, of all 77 Thai provinces in 2015, M-estimation is represented by blue double dash line, quantile regression is represented by green long dotted line, and RANSAC represented by red dash line. (b) Illustrates the scatter plot and estimate regression curves with M-estimation, quantile regression and RANSAC of the number of rainy days and the the rainfall mm, of all 47 Thai provinces in 2013, M-estimation is represented by blue double dash line, quantile regression is represented by green long dotted line, and RANSAC represented by red dash line. (c) Illustrates the scatter plot and estimate regression curves with M-estimation, quantile regression and RANSAC of the relative humidity and the rainfall mm, of all 47 Thai provinces in 2012, M-estimation is represented by blue double dash line, quantile regression is represented by green long dotted line, and RANSAC represented by red dash line. (d) Illustrates the scatter plot and estimate regression curves with M- estimation, quantile regression and RANSAC of the number of people using water and the production capacity water supply volume m3, of 74 Thai provinces in 2019, M-estimation is represented by blue double dash line, quantile regression is represented by green long dotted line, and RANSAC represented by red dash line. Comparative the performance of robust linear regression methods 821

Data Methods ˆ ˆ 0 1 1 M-estimation 2753.6714 0.6906 Quantile regression 2875.0967 0.6816 RANSAC 3098.0000 0.6740 2 M-estimation -1148.0707 20.6268 Quantile regression -1049.6250 19.2688 RANSAC -1066.6435 19.4694 3 M-estimation -7755.8732 125.8379 Quantile regression -7366.1571 120.2102 RANSAC -6180.3111 103.0861 4 M-estimation 2998.4501 1.3772 Quantile regression 3587.7236 1.1842 RANSAC 3889.4809 1.2556

When we consider the parameter from Table 1. and Figure 2. The result show that, when no outlier, the parameter estimate using M- estimation quantile regression and random sample consensus (RANSAC) produce a very similar parameter estimate.When the rate of outliers stand at 1 and 2 is also conducted the parameter estimate using RANSAC method more different from M -estimation and quantile regression. The RANSAC method is more robust to outlier than the M- estimation and quantile regression method. Which can be clearly seen that, The RANSAC methods are more robust against the outlier than M- estimation and quantile regression ,because in parameter estimation do not include the outlier into the model, only inliers are used to run the model.

4. Conclusion and Discussion

The present study aims to apply the parameter estimation method that robust for the outliers. Especially, when the amount of outlier are increasing. It will reduce the reliability of the model. M- estimation, quantile regression and Random Sample Consensus (RANSAC) are considered. The result show that, when no outlier, the parameter estimate using M- estimation, quantile regression and random sample consensus (RANSAC) produce a very similary. When the rate of outliers increasing is also conducted the parameter estimate using RANSAC method more different from M- estimation and quantile regression. The RANSAC method is more robust to outlier than the M- estimation and quantile regression method. Which can be clearly seen that, The RANSAC methods are more robust against the outlier than M- estimation and quantile regression ,because in parameter estimation do not include the outlier into the model, only inliers are used to run the model. That is, RANSAC exhibits more robustness to outliers, when the data under investigation contain outliers, particularly for a high proportion of outliers. As for further research, should be a simulate outlier in many types in order to increase the credibility of the study results. 822 Unchalee Tonggumnead and Nikorn Saengngam

Acknowledgements. The author wishes to gratefully acknowledge the referee of this paper who helped to clarify and improve its presentation.

References

[1] P. J. Huber, , Springer Berlin Heidelberg, 2011, 1248 - 1251. https://doi.org/10.1007/978-3-642-04898-2_594

[2] J. Breckling and R. Chambers, M-quantiles, Biometrika, 75 (4) (1988), 761-771. https://doi.org/10.1093/biomet/75.4.761

[3] R. Koenker and Jr, G. Bassett, Regression quantiles, Econometrica: Journal of the Econometric Society, 46 (1) (1978), 33-50. https://doi.org/10.2307/1913643

[4] M.A. Fischler and R.C. Bolles, Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography, Communications of the ACM, 24 (6) (1981), 381-395. https://doi.org/10.1145/358669.358692

[5] N.R. Draper and H. Smith, Applied Regression Analysis, John Wiley & Sons, 1998. https://doi.org/10.1002/9781118625590

[6] Y. Susanti, H. Pratiwi, S. Sulistijowati H., T. Liana, M estimation, S estimation, and MM estimation in robust regression, International Journal of Pure and Applied Mathematics, 91 (3) (2014), 349-360. https://doi.org/10.12732/ijpam.v91i3.7

[7] C. Chen, An introduction to quantile regression and the QUANTREG procedure, in: Proceedings of the Thirtieth Annual SAS Users Group International Conference, SAS Institute Inc. Cary, NC, 2005.

[8] National Statistical Office, Household income and the number of households. Available online at: http://service.nso.go.th/nso/web/statseries/statseries11.html, 2019.

[9] Thai Meteorological Department, Annual Rainfall, Rain-day and Relative Humidity: Selected Location by Region 2012 – 2013, 2019. Available online at: http://service.nso.go.th/nso/web/statseries/statseries27.html

[10] Provincial waterworks Authority, Summary of Provincial Waterworks Authority data by province, 2019. Available online at: https://www.pwa.co.th/province/report

Received: August 5, 2019; Published: September 3, 2019