The Identification of Good and Bad High Leverage Points in Multiple Linear Regression Model
Total Page:16
File Type:pdf, Size:1020Kb
Mathematical Methods and Systems in Science and Engineering The Identification of Good and Bad High Leverage Points in Multiple Linear Regression Model HABSHAH MIDI1 and MOHAMMED A. MOHAMMED2 Faculty of Science and Institute For Mathematical Research University Putra Malaysia 43400 UPM Serdang Selangor, Malaysia Malaysia [email protected]; [email protected] Abstract: - Much work has been carried out on the detection of high leverage points without paying much attention to classifying them into good and bad leverage points. It is crucial to detect bad leverage point as it has unduly effect on the parameter estimates. In this paper, we propose a new technique to identify good and bad leverage points. We investigate the performance of our proposed method by employing some well-known data sets. Key-Words: - DRGP, MGDFFITS, Outlier, High Leverage Points, Studentized Residual 1 Introduction There are a number of good papers in the There are several versions of outliers in literature for the detection of high leverage regression problem. Observations are judged as point [see [2], [8], [10]. Nonetheless, those residual outliers on the basis of how identification methods are mostly focused on unsuccessful the fitted regression equation is in the detection of high leverage points without accommodating them. As such any observation pin- pointing the good and bad leverage points. that has large residual is referred to as residual It is very important to detect and classify the outlier. Observations which are extreme or good and bad leverage points as only the bad outlying in the Y-coordinate, is call Y-outlier. leverage points are responsible for the Additionally, high leverage points are those misleading conclusion about the fitting of observations which are outlying in the X regression model. variables. It can be classified into good and bad leverage points. Good leverage points and bad It is easy to capture the existence of several leverage points are those outlying observations versions of outliers in regression analysis by in the explanatory variables that follow and do using graphical method. If only one not follow the pattern of the majority of the independent variable is being considered, the data, respectively. Bad leverage point has a four type of outliers can easily be observed larger impact on the computed values of various from a scatter plot of y against the x variables. estimates. On the other hand, good leverage For more than one independent variables, point contribute to the efficiency of an Rousseeuw and Van Zomeren (1990) [16] have estimate. In this respect, only bad leverage proposed a robust diagnostic plot or outlier points need to be down weighted and good map which is more efficient than the non-robust leverage point should not be given low weight plot for classifying observations into four type in the computation of weighting function in any of data points specifically, regular or good robust method. However, it is now evident that observations, good leverage points, bad most robust methods attempt to reduce the leverage points and vertical outliers. The effect of outliers by down-weighting the identification of multiple vertical outliers and outliers irrespective of whether they are good or multiple high leverage points are presented in bad leverage points. Chapter 2. In Chapter 3, we proposed an improved procedure of classifying ISBN: 978-1-61804-281-1 147 Mathematical Methods and Systems in Science and Engineering observations into the four categories. Two p= wiiii /(1 − wii ) , i = 1,2,..., n )7( numerical examples are presented in Chapter 4 to show the merit of our proposed method. Unfortunately, all of these diagnostic measures are not successful in identifying multiple high leverage points. To rectify this problem, Imon 2 Identification of multiple y-outliers (2005) [10] proposed the ‘generalized and multiple high leverage points potentials’ (GP) denoted as p * . The GP ii . diagnostic method able to detect multiple high leverage points, but it is not adequately Consider a multiple linear regression model: effective in identifying the exact number of YX=β + ε )1( outliers. This is due to the choice of the initial basic subset. Habshah et al. (2009) [14] has where Y is an n×1 vector of observation of developed Diagnostic Robust Generalized dependant variables , X is an n×p matrix of Potential (DRGP) to improve the rate of independent variables, β is p×1 vector of detection of high leverage points. The DRGP unknown regression parameters, ε is an n×1 consists of two steps whereby in the first step, vector of random errors with identically normal 2 robust method is used to identify the suspected distribution as ε ~NID (0,σ ) , and p is the high leverage points. On the second step, the number of independent variables. The linear generalized Potential diagnostic approach is regression model in (1) can be re-written as used to confirm our suspicion. Habshah et al. follows; noted that the low leverage points( if any) are put back into the estimation subset R T yi= x i β + ε i ,i = 1,2,..., n , )2( sequentially (observation with the least pii will where yi is the observed of dependent variable be substituted at first) and to re-compute pii and xi is a p×1 vector of predictors. The OLS values. This process is continued until all estimates in (1) is given by member of the deletion set is checked whether βˆ = ()XXXYT −1 T )3( or not they can be declared as high leverage points. and the ith residuals can be expressed in term of the true disturbance as: The suspect high leverage points are ˆ εˆi =YYW − =(1 − )ε )4( determined by the robust Mahalanobis distance where W = XXXX()T −1 − 1 is known as (RMD), based on the minimum volume hat matrix. The diagonal elements of leverage ellipsoid (MVE) developed by Rousseeuw [11] matrix is called the hat values (see as [1],[3],[8],[11]) and denoted as wii , given by T −1 RMD i =(XTXCXXTX − R ( ))R ( ) (− R ( )), TT −1 = wii = xi ( X X ) xi , i= 1,2,..., n )5( fori 1,2, . n (8) The hat matrix is often used as diagnostics to identify leverage points [10]. Unfortunately, where TXR () and CXR () are robust locations the hat matrix suffers from the masking effect. and shape estimates of the MVE, respectively. So, wii often fail to detect high leverage points. Imon [10] suggested a cut-off value for the Hadi (1992) [8] suggested a single case deleted robust Mahalanobis distances as measure so called Potentials . The diagonal elements of Potential denoted as (pii) is given Median (RMDi )+ 3MAD(RMDi ) by (see[3],[8]) T T −1 Let us denote a set of ‘good’ cases ‘remaining’ in pii = xi ( X()()i Xi ) x i , i= 1,2,..., n )6( the analysis by R and a set of ‘bad’ cases ‘deleted’ where X is the matrix X with exclude the ith (i) by D. Also suppose that R contains (n – d) cases row. We can rewrite pii as a function of wii as ISBN: 978-1-61804-281-1 148 Mathematical Methods and Systems in Science and Engineering after d < (n – p) cases in D are deleted. Once the identified using the robust RLS residuals remaining set is determined, then the second (Rousseeuw& Leroy 1987 [11]) and steps of DRGP are carried out to confirm the Generalized Potentials (Imon, 2002 [9]), suspected high leverage points by using the respectively. The union of set of suspected ‘generalized potentials’ (GP) denoted outliers and set of suspected high leverage * points become members of the deletion set, pii defined as which has d observations. Nevertheless, the ()−D initial basic subset of Gti is not very stable and wii for i∈ D suffer from masking effects. In this regards, the * = ()−D pii wii DRGP is employed to rectify this problem. for i∈ R ()−D Subsequently, the vector of estimated 1 − wii parameters in the remaining groups, denoted as ()−D T T −1 βˆ are defined as where wii = Xi () XR XR X i . Observations ()R * in which the pii values larger than the following βˆ = ()XXXYT −1 T (11) threshold, ()R R RR R The ith deletion observation is given by * * * pii > Median()Pii + cMAD() Pii T εˆ = y− xβˆ , i= 1,2,..., n , (12) i() R i i ()R i where c can be taken as a constant value of 2 or 3, are declared as high leverage points. The The ith externally Studentized residual t * for Studentized residuals (internally Studentized the remaining groups R, is given by residuals) and R-Student residual (externally Studentized residuals) are widely used T ˆ * yi− x i β ()R measures for the identification of outliers (see ti = (13) σˆ − 1− w [5] ). The Studentized residuals is defined as R i ii()R The diagonal element of the hat matrix is given εˆi ri = ,i= 1,2,..., n )9( by σˆ 1− wii T T −1 wii()R= x i ( X()()R XR ) x i , i= 1,2,..., n (14) T By utilizing the results of Rao (1965) , an where σˆ =[ εˆ ε ˆ /n − p − 1] is the residuals additional point i in R set is defined as mean square. The special case (8) is: εˆi T T T −1 wii()R ti = ,i= 1,2,..., n (10) wii()R+ i= x i ( XR XR+ x i x i ) xi = (15) σˆ ()i 1− wii 1+ wii()R ˆ T T −1 T β ()R+ i =(XR XR + x i x i )( XR YR+ x i y i ) is called the R-student, where σˆ ()i is the T −1 ˆ ()XXR R residuals mean square excluding the ith case.