Mathematical Methods and Systems in Science and Engineering

The Identification of Good and Bad High Leverage Points in Multiple Linear Regression Model

HABSHAH MIDI1 and MOHAMMED A. MOHAMMED2 Faculty of Science and Institute For Mathematical Research University Putra Malaysia 43400 UPM Serdang Selangor, Malaysia Malaysia [email protected]; [email protected]

Abstract: - Much work has been carried out on the detection of high leverage points without paying much attention to classifying them into good and bad leverage points. It is crucial to detect bad leverage point as it has unduly effect on the parameter estimates. In this paper, we propose a new technique to identify good and bad leverage points. We investigate the performance of our proposed method by employing some well-known data sets.

Key-Words: - DRGP, MGDFFITS, Outlier, High Leverage Points, Studentized Residual

1 Introduction There are a number of good papers in the There are several versions of outliers in literature for the detection of high leverage regression problem. Observations are judged as point [see [2], [8], [10]. Nonetheless, those residual outliers on the basis of how identification methods are mostly focused on unsuccessful the fitted regression equation is in the detection of high leverage points without accommodating them. As such any observation pin- pointing the good and bad leverage points. that has large residual is referred to as residual It is very important to detect and classify the outlier. Observations which are extreme or good and bad leverage points as only the bad outlying in the Y-coordinate, is call Y-outlier. leverage points are responsible for the Additionally, high leverage points are those misleading conclusion about the fitting of observations which are outlying in the X regression model. variables. It can be classified into good and bad leverage points. Good leverage points and bad It is easy to capture the existence of several leverage points are those outlying observations versions of outliers in regression analysis by in the explanatory variables that follow and do using graphical method. If only one not follow the pattern of the majority of the independent variable is being considered, the data, respectively. Bad leverage point has a four type of outliers can easily be observed larger impact on the computed values of various from a scatter plot of y against the x variables. estimates. On the other hand, good leverage For more than one independent variables, point contribute to the efficiency of an Rousseeuw and Van Zomeren (1990) [16] have estimate. In this respect, only bad leverage proposed a robust diagnostic plot or outlier points need to be down weighted and good map which is more efficient than the non-robust leverage point should not be given low weight plot for classifying observations into four type in the computation of weighting function in any of data points specifically, regular or good robust method. However, it is now evident that observations, good leverage points, bad most robust methods attempt to reduce the leverage points and vertical outliers. The effect of outliers by down-weighting the identification of multiple vertical outliers and outliers irrespective of whether they are good or multiple high leverage points are presented in bad leverage points. Chapter 2. In Chapter 3, we proposed an improved procedure of classifying

ISBN: 978-1-61804-281-1 147 Mathematical Methods and Systems in Science and Engineering

observations into the four categories. Two iiii /( ii =−= ,...,2,1,)1 niwwp )7( numerical examples are presented in Chapter 4 to show the merit of our proposed method. Unfortunately, all of these diagnostic measures are not successful in identifying multiple high leverage points. To rectify this problem, Imon 2 Identification of multiple y-outliers (2005) [10] proposed the ‘generalized and multiple high leverage points potentials’ (GP) denoted as p * . The GP ii . diagnostic method able to detect multiple high leverage points, but it is not adequately Consider a multiple linear regression model: effective in identifying the exact number of XY += εβ )1( outliers. This is due to the choice of the initial basic subset. Habshah et al. (2009) [14] has where Y is an n×1 vector of observation of developed Diagnostic Robust Generalized dependant variables , X is an n×p matrix of Potential (DRGP) to improve the rate of independent variables, β is p×1 vector of detection of high leverage points. The DRGP unknown regression parameters, ε is an n×1 consists of two steps whereby in the first step, vector of random errors with identically normal 2 robust method is used to identify the suspected distribution as ε NID σ ),0(~ , and p is the high leverage points. On the second step, the number of independent variables. The linear generalized Potential diagnostic approach is regression model in (1) can be re-written as used to confirm our suspicion. Habshah et al. follows; noted that the low leverage points( if any) are put back into the estimation subset R T xy ii εβ i =+= ni ,,...,2,1, )2( sequentially (observation with the least pii will where yi is the observed of dependent variable be substituted at first) and to re-compute pii and xi is a p×1 vector of predictors. The OLS values. This process is continued until all estimates in (1) is given by member of the deletion set is checked whether βˆ = T )( −1 T YXXX )3( or not they can be declared as high leverage points. and the ith residuals can be expressed in term of the true disturbance as: The suspect high leverage points are ˆ εˆi −=−= WYY )1( ε )4( determined by the robust Mahalanobis distance where W = T )( XXXX −− 11 is known as (RMD), based on the minimum volume hat matrix. The diagonal elements of leverage ellipsoid (MVE) developed by Rousseeuw [11] matrix is called the hat values (see as [1],[3],[8],[11]) and denoted as wii , given by T −1 RMD i −= R R − R XTXXCXTX ,))(()())(( TT −1 = ii = i i = ,...,2,1,)( nixXXxw )5( ni )8(.,2,1for

The hat matrix is often used as diagnostics to identify leverage points [10]. Unfortunately, where R ()XT and R ()XC are robust locations the hat matrix suffers from the masking effect. and shape estimates of the MVE, respectively. So, wii often fail to detect high leverage points. Imon [10] suggested a cut-off value for the Hadi (1992) [8] suggested a single case deleted robust Mahalanobis distances as measure so called Potentials . The diagonal elements of Potential denoted as (pii) is given Median i + 3)(RMD MAD i )(RMD by (see[3],[8]) T T −1 Let us denote a set of ‘good’ cases ‘remaining’ in ii = i i )()( ii = ,...,2,1,)( nixXXxp )6( the analysis by R and a set of ‘bad’ cases ‘deleted’ where X is the matrix X with exclude the ith (i) by D. Also suppose that R contains (n – d) cases row. We can rewrite pii as a function of wii as

ISBN: 978-1-61804-281-1 148 Mathematical Methods and Systems in Science and Engineering

after d < (n – p) cases in D are deleted. Once the identified using the robust RLS residuals remaining set is determined, then the second (Rousseeuw& Leroy 1987 [11]) and steps of DRGP are carried out to confirm the Generalized Potentials (Imon, 2002 [9]), suspected high leverage points by using the respectively. The union of set of suspected ‘generalized potentials’ (GP) denoted outliers and set of suspected high leverage * points become members of the deletion set, pii defined as which has d observations. Nevertheless, the −D)( initial basic subset of Gti is not very stable and wii for ∈ Di suffer from masking effects. In this regards, the * = −D)( pii wii DRGP is employed to rectify this problem. for ∈ Ri −D)( Subsequently, the vector of estimated 1 − wii parameters in the remaining groups, denoted as −D)( T T −1 βˆ are defined as where ii = i R )( XXXXw iR . Observations R)( * in which the pii values larger than the following βˆ = T )( −1 T YXXX ( )11 threshold, R)( R RR R The ith deletion observation is given by * * * pii > Median ii )( + MAD PcP ii )( T εˆ = − βˆ = nixy ,,...,2,1, ( )12 )( iiRi R)( i where c can be taken as a constant value of 2 or 3, are declared as high leverage points. The The ith externally Studentized residual t * for Studentized residuals (internally Studentized the remaining groups R, is given by residuals) and R-Student residual (externally Studentized residuals) are widely used T ˆ * − xy ii β R)( measures for the identification of outliers (see ti = )13( σˆ − 1− w [5] ). The Studentized residuals is defined as iR ii R)( The diagonal element of the hat matrix is given εˆi ri = = ,...,2,1, ni )9( by σˆ 1− wii T T −1 ii )( = iR R )()( iR = ,...,2,1,)( nixXXxw )14(

T By utilizing the results of Rao (1965) , an where ˆ [ εεσ ˆˆ pn −−= ]1/ is the residuals additional point i in R set is defined as mean square. The special case (8) is:

εˆi T T T −1 wii R)( ti = = ,...,2,1, ni )10( ii + )( = iiR ( R + iiR ) xxxXXxw i = )15( σˆ i)( 1− wii 1+ wii R)( ˆ T T −1 T β +iR )( ( R += iiR () R + yxYXxxXX iiR ) is called the R-student, where σˆ i)( is the T −1 ˆ R XX R )( residuals mean square excluding the ith case. β R += εˆ Ri )( )16( 1+ w This two measures also not be able to detect ii R)( multiple outliers. This lead to the formulation of the externally studentized residual for ∉ Ri defined as Imon [10] suggested a generalized studentized residual (Gti) based on a group of deletion to T ˆ * − xy ii β +iR )( εˆ Ri )( identify multiple outliers or vertical outliers. ti = = )17( σˆ − σˆ + The generalized version of regression R 1 wii +iR )( R 1 wii R)( diagnostics first requires to select deletion Subsequently, the Modified Generalized group D that contains all suspect influential Studentized residuals (MGti) for the whole data cases. The suspect influential cases consider set can be developed by combining Equation outliers and high leverage points separately (13) and Equation (17) as follows; whereby outliers and high leverage points are

ISBN: 978-1-61804-281-1 149 Mathematical Methods and Systems in Science and Engineering

εˆ Ri )( for, ∈Ri σ − ˆ −iR 1 wii R)( MGti = )18( εˆ Ri )( for, ∉Ri σˆ R 1+ wii R)(

3 New Diagnostic Plot For Classifying Observation into four categories

Rousseeuw and Van Zomeren (1990) [16] proposed a robust diagnostic plot which is more effective than the non-robust plot for classifying observation into regular observation, 4 Example and discussion vertical outliers, good and bad high leverage points. The proposed plot, plots the 4.1 Stackloss Data standardized LMS residual against the robust distant based on MVE. The non-robust plot, Our first example is the well-known Stack loss plots the standardized OLS residual against the dataset which is taken from Brownlee (1965) Mahalanobis distant (MD). We suspect that the [13], that contains 21 observations with three robust diagnostic plot is not very effective in independent variables (Air flow, Cooling water classifying the observations into respective and Acid concentration). Many researchers categories since it is based on robust pointed out this dataset has one vertical outlier mahalanobis distant which suffers from ( Case 4) and four high leverage points (Cases masking and swamping effects. As such we 1, 2, 3, 21). It can be seen from Table 1 that all proposed to improve the classification method existing methods fail to identify those by plotting the MGti versus DRGP by the observations correctly. The DRGP can only following rule: detects four high leverage points correctly but not be able to identify Case 4 as vertical outlier. The MGti successfully detects the five i- Regular Observation (RO): Any observation observations as outliers but it does not specify is declared as RO if MGti ≤ 5.2|| and to which category are those outliers belong to. p* < Median *)( + MAD PcP *)( ii ii ii ii- Vertical Outliers (VO): Any observation is Next, we would like to see the classification made by OLS-MD, LMS-RMD and MGti- declared as VO if MGti > 5.2|| and DRGP plots. Due to space constraint, the plots p* < Median *)( + MAD PcP *)( ii ii ii are not displayed. The OLS-MD plot fails to

iii- Good Leverage Point (GLP): Any observation locate any outlier. Both LMS-RMD and MGti- ≤ is declared GLP if MGti 5.2|| and DRGP can classify the observations into their * * * (19) pii ≥ Median ii )( + MAD PcP ii )( respective categories; where cases (1,2,3,21) iv- Bad Leverage Point (BLP): Any observation and case 4 are classified as high leverage point is declared BLP if MGti > 5.2|| and and vertical outlier, respectively. p* ≥ Median * )( + MAD PcP * )( ii ii ii Next, we would like to observe the effect of

removing those outliers (vertical and bad leverage points) from the data set. The results of Table 2 show that those five outliers have a

ISBN: 978-1-61804-281-1 150 Mathematical Methods and Systems in Science and Engineering

very significant effect on the parameter maximal thrust) and the response variable is estimates. Removing them from the data set cost and contain 23 cases. Table 3 show that ti has reduced the standard errors of all estimates fails to identify any outlier. The DRGP, MGti detect 3 outliers but one of the detected outlier * Table1: MD, RMD, DRGP, ti, ti and MDti for Stackloss is not the same. It can be observed From data * Figures 1 that the non-robust plot cannot Case MD RMD DRGP ti ti MGti No. (3.05) (3.05) (0.73) (2.5) (2.5) (2.5) identify any outlier but only detects cases 22 1 2.25 5.53 1.97 1.00 5.50 4.10 and 14 as good leverage points. The LMS- 2 2.32 5.64 2.05 -0.59 2.59 2.77 RMD plot detects 1 observation (case 22) as 3 1.59 4.20 1.16 1.40 5.09 4.63 bad high leverage point and 3 good leverage 4 1.27 1.59 0.28 1.76 5.46 6.47 points (cases 15, 20, 14), while the MGti-DRGP 5 0.30 1.19 0.17 -0.53 0.06 -0.05 6 0.77 1.31 0.20 -0.93 -0.15 -0.59 identifies a bad leverage points (Cases 19 and 7 1.85 1.72 0.32 -0.74 0.23 0.00 22), one vertical outlier (case 16) and one good 8 1.85 1.72 0.32 -0.43 0.81 0.99 leverage point (case 21). Next, we would like 9 1.36 1.23 0.18 -0.97 -0.23 -0.83 to justify which plot has identified or has 10 1.75 1.94 0.41 0.39 0.23 0.78 classified the observations correctly. The 11 1.47 1.49 0.25 0.81 0.23 1.00 12 1.84 1.91 0.40 0.86 -0.15 0.61 correct plot is the one that when deleting the 13 1.48 1.66 0.30 -0.44 -1.52 -2.10 identified bad leverage points and vertical 14 1.78 1.69 0.31 -0.02 -1.14 -1.36 outliers causes significant changes to the 15 1.69 2.23 0.59 0.73 0.06 0.57 parameter estimates, and able to reduce more 16 1.29 1.77 0.34 0.28 -0.52 -0.55 standard errors of the estimates than the other 17 2.70 2.43 0.75 -0.47 -0.15 -1.06 18 1.50 1.52 0.26 -0.14 -0.15 -0.39 plots. It can be observed from Table 4 that the 19 1.59 1.71 0.32 -0.18 0.23 0.07 standard errors of the estimates when removing 20 0.81 0.68 0.10 0.44 1.22 1.83 observations 16, 22 and 19 (identified by MGti- 21 2.18 3.66 0.89 -2.23 -4.59 -4.31 DRGP) is lesser than when removing only one observation (identified by LMS-RMD)

* Table3: MD, RMD, DRGP, ti, ti and MDti for Aircraft Table 2: The Regression estimates for Stackloss Data data Case MD RMD DRGP ti ti* MGti No. (3.34) (3.34) (0.36) (2.5) (2.5) (2.5) 1 1.76 1.78 0.05 0.81 0.94 0.37 2 1.53 2.53 0.05 1.14 1.14 0.79 3 1.55 1.78 0.05 1.19 1.42 1.46 4 1.57 1.50 0.05 -0.67 -0.66 -1.71 5 1.11 1.37 0.11 -0.16 0.09 -0.69 6 2.17 3.48 0.24 -0.81 0.18 -0.24 7 1.42 2.86 0.09 -0.53 -0.04 -0.83 8 1.91 2.08 0.07 0.84 0.18 -0.35 9 2.09 2.02 0.04 0.04 -0.18 -0.16 10 1.96 2.10 0.29 -0.85 -0.02 0.87 11 1.64 1.65 0.08 -0.27 0.08 1.04 12 0.64 1.26 0.13 -1.77 -1.30 -1.62 13 0.88 1.09 0.11 -0.15 0.18 0.01 14 4.29 27.73 0.12 0.01 -2.16 -0.17 15 0.78 4.21 0.04 -0.06 -0.18 -0.01 16 1.66 2.61 0.24 0.09 1.14 4.74 17 2.09 2.12 0.17 -1.59 -1.49 -0.19 18 1.19 1.56 0.11 -0.24 -0.06 1.77 19 2.29 2.03 0.41 0.56 0.18 4.35 4.2 Aircraft Data 20 1.55 8.68 0.01 1.00 0.18 0.37 21 2.42 2.13 0.45 -0.41 -0.81 2.12 Our second example is the Aircraft dataset 22 3.42 8.35 0.50 2.09 4.87 11.73 which is taken from Gray (1985) [7]. This 23 1.11 1.51 0.08 -0.27 -0.71 -0.68 dataset has four predictor variables (aspect ratio, lift to drag ratio, weight of the plane and Table 4: The Regression estimates for Aircraft Data

ISBN: 978-1-61804-281-1 151 Mathematical Methods and Systems in Science and Engineering

Figure 3: The Modified Generalized std. residual against DRGP for Aircraft Data

Figure 1: The OLS std. residual against MD for Aircraft Data

5 Conclusion

In this paper we proposed a new method for the identification of bad leverage points by means of diagnostic plot. The classical diagnostic plot fails to identify the bad leverage points. The robust LMS-RMD plot also not very successful in classifying observations into four categories. In this regards we propose MGti- DRGP which is very successful in classifying observations into regular observation, vertical outliers, good and bad leverage points.

Figure 2: The std. LMS residual against RMD for References: Aircraft Data [1] Atkinson, A. C. Fast very robust methods for the detection of multiple outliers, Journal of the American Statistical Association, Vol. 89, 1994, pp. 1329– 1339. [2] Belsley DA, Kuh E, Welsch RE, Regression diagnostics, Identifying influential data and sources of collinearity. J Wiley, New York, 1980. [3] Chatterjee, S. & Hadi, A. S. Sensitivity Analysis in Linear Regression, J Wiley, New York, 1988. [4] Cook, R. D. Detection of inﬂuential observations in linear regression, Technometrics, Vol. 19, 1977, pp. 15–18.

ISBN: 978-1-61804-281-1 152 Mathematical Methods and Systems in Science and Engineering

[5] Cook, R. D. & Weisberg, S. Residuals and [16] Rousseeuw, P., and van Zomeren, B., Inﬂuence in Regression, Chapman & Hall , “Unmasking Multivariate Outliers and New York, 1982 . Leverage Points,” Journal of the American [6] Ellenberg, J. H. Testing for a single outlier Statistical Association, Vol. 85, 1990, from a general regression, Biometrics, Vol. pp.633–639. 32, 1976, pp. 637–645. [7] Gray, G. B., Graphics for Regression Diagnostics, In American Statistical Association Proceedings of the Statistical Computing Section, Washington, D. C, ASA, 1985, pp. 102–107. [8] Hadi, A. S. A new measure of overall potential inﬂuence in linear regression, Computational Statistics and Data Analysis, Vol. 14, 1992, pp. 1–27. [9] Imon, A. H. M. Identifying multiple high leverage points in linear regression, Journal of Statistical Studies, Special Volume in Honour of Professor Mir Masoom Ali, 2002, pp. 207–218. [10] Imon, A. H. M., Identifying multiple inﬂuential observations in linear regression, Journal of Applied Statistic. Vol. 32, 2005, pp. 929–946. [11] Rousseeuw, P. J. & Leroy, A. Robust Regression and Outlier Detection, J Wiley, New York, 1987. [12] Vinoth B. and Rajarathinam, Outliers Detection in Simple linear Regression Models and Robust – A case study on Wheat Produaction Data, Journal of Statistics, Vol. 3, 2014, pp. 531-536. [13] Brownlee, K. A., Statistical Theory and Methodology in Science and Engineering, J Wiley, New York, 1965. [14] Habshah, M, Norazan M. R. and Imon A.H.M., The performance of Diagnostic- Robust Generalized Potentials for the identification of multiple high leverage points in linear regression , Journal of Applied Statistic. Vol. 5, 2009, pp.507-520. [15] Narula, S. C., Saldiva, P. N., Andre, C. D. S., Elian, S. N. A. F. and Capelozzi, V., The minimum sum of absolute errors regression: a robust alternative to the least squares regression. Statistics in Medicine. Vol. 18, 1999, pp.1401-1417.

ISBN: 978-1-61804-281-1 153