Chapter 6 Diagnostic for Leverage and Influence

Chapter 6 Diagnostic for Leverage and Influence The location of observations in x -space can play an important role in determining the regression coefficients. Consider a situation like in the following yi A Xi The point A in this figure is remote in x space from the rest of the sample but it lies almost on the regression line passing through the rest of the sample points. This is a leverage point. It is an unusual x -value and may control certain model properties. - This point does not affect the estimates of the regression coefficients. - It affects the model summary statistics e.g., R2 , standard errors of regression coefficients etc. Now consider the point B following figure: yi B Xi This point has a moderately unusual x -coordinate and the y -value is also unusual. This is an influence point - It has a noticeable impact on the model coefficients. - It pulls the regression model in its direction. Regression Analysis | Chapter 6 | Diagnostic for Leverage and Influence | Shalabh, IIT Kanpur 1 Sometimes a small subset of data exerts a disproportionate influence on the model coefficients and properties. In an extreme case, the parameter estimates may depend more on the influential subset of points than on the majority of the data. This is an undesirable situation. A regression model has to be a representative of all the sample observations and not only of a few. So we would like to find these influential points and asses their impact on the model. - If these influential points are “bad” values, they should be eliminated from the sample. - If nothing is wrong with these points, but if they control the model properties, then it is to be found that how do they affect the regression model in use. Leverage The location of points in x -space affects the model properties like parameter estimates, standard errors, predicted values, summary statistics etc. The hat matrix H XXX(')1 X ' plays an important role in identifying influential observations. Since Vy()ˆ 2 H Ve() 2 ( I H ), (yˆ is fitted value and e is residual) the elements hii of H may be interpreted as the amount of leverage th th excreted by the i observation yi on the i fitted value yî . The ith diagonal element of H is '1 hxXXxii i(') i ' th where xi is the i row of X -matrix. The hat matrix diagonal is a standardized measure of the distance of ith an observation from the centre (or centroid) of the x space. Thus large hat diagonals reveal observations that are potentially influential because they are remote in x -space from the rest of the sample. Average size of hat diagonal ()h is h rank(H ) h ii nn rank(X ) n trH n k n Regression Analysis | Chapter 6 | Diagnostic for Leverage and Influence | Shalabh, IIT Kanpur 2 2k If hh2 the point is remote enough from rest of the data to be considered as a leverage ii n point. 2k Care is needed in using cutoff value and magnitudes of k and n are to be assessed. There can be n 2k situations where 1 and then this cut off does not apply. n All leverage points are not influential on the regression coefficients. In the following figure yi A Xi the point A - will have a large hat diagonal and is surely a leverage point. - have no effect of the regression coefficients as it lies on the same line passing through the remaining observations. Hat diagonal examine only the location of observations in x -space, so we can look at the studentized residual or R -student in conjunction with the hii . Observation with - large hat diagonal and - large residuals are likely to be influential. Measures of influence (1) Cook’s D-statistics: If data set is small, then the deletion of values greatly affects the fit and statistical conclusions. In measuring influence, it is desirable to consider both - the location of point is x -space and - the response variable. Regression Analysis | Chapter 6 | Diagnostic for Leverage and Influence | Shalabh, IIT Kanpur 3 The Cook’s distance statistics denoted as, Cook’s D-statistic is a measure of the distance between the least- squares estimate based on all n observations in b and the estimate obtained by deleting the ith point, say b()i . It is given by ()'()bbMbb DMC( , )()ii () ; i 1,2,..., n . i C The usual choice of M and C are M XX' CkMS rse So ()''()bbXXbb DXXkMS(', )()ii () ; i 1,2,..., n irse kMS rse ()'()yyˆˆyy ˆˆ ()ii () kMSrse where yXbˆ yXbˆ()ii () bXXX (')1 '.y Points with large Di the points have considerable influence of OLSE b . Since ()''()/bbXXbbk()ii () SSrse /( n k ) looks like a statistic having a Fkn(, k ) distribution. Note that this statistics is not having a Fkn(, k ) distribution. So the magnitude of Di is assessed by comparing it with Fknk (, ). If DFknki 0.5 (, ), then deleting ˆ point i would move ()i to the boundary of an approximate 50% confidence region for based on the complete data set. This displacement is large and indicates that the OLSE is sensitive to the ith data point. Since Fknk0.5 (, ) 1, we usually consider that points for which Di 1 to be influential. Ideally, each b()i is expected to stay within the boundary of a 10-20% confidence region. Di is not an F -statistic but cut off of 1 work very well in practice. Regression Analysis | Chapter 6 | Diagnostic for Leverage and Influence | Shalabh, IIT Kanpur 4 Since ()'()yyyyˆˆˆˆ()ii () Di , kMSrse so Di can be interpreted as a squared Euclidian distance (apart from kMSrse ) that the vector of fitted values moves when the ith observation is deleted. Since 1 (')XX xeii bbii() , 1 hii the Di can be written as ()''()bbii() XXbb ii () Di kMSrse '1 12 xiii(')(')(')XX XX XX xe 2 (1 hkMSii ) re s 2 eh iii 1 hkMSii re s rh2 iii kh1 ii where ri is studentized residual. th Di : product of squared i studentized residual and hhii/(1 ii ). th : Reflects how well the model fits the i observation yi and a component that measures how far that point is from the rest of the data. : Either component or both may contribute to a large value of Di . th : Thus Di combines residual magnitude for i observation and location of points in x - space to assess influence. Regression Analysis | Chapter 6 | Diagnostic for Leverage and Influence | Shalabh, IIT Kanpur 5 (2) DFFITS and DFBETAS: Cook’s distance measure is a deletion diagnostic, i.e., it measures the influence of ith observation if it is removed from the sample. There are two more statistics: (i) DFBETAS which indicates how much the regression coefficient changes if the ith observation were deleted. Such change is measured in terms of standard deviation units. This statistic is bb DFBETAS j ji() ji, 2 SC()ijj th 1 where C jj is the j diagonal element of (')X X and bj()i regression coefficient computed without the use of ith observation. th Large (in magnitude) value of DFBETAS j,i , indicates that i observation has considerable influence on the jth regression coefficient. (i) The values of DFBETAS j,i can be expressed in a nk matrix that conveys similar information to the composite influence information in Cook’s distance measure. (ii) The n elements in the jth row of R produce the leverage that the n observations in the sample ˆ th have on j . DFBETAS j,i is the j element of ()bb ()i divided by a standardization factor 1' (')X Xxeii bbii() . 1 hii th The j element of ()bbii () can be expressed as re bbjjj,ii. ii() 1 hii Let rRji, (( )) denotes the elements of R , so 11 (RR ')'(') XX XXXX '(') ' 1 (')XX C RR'. Regression Analysis | Chapter 6 | Diagnostic for Leverage and Influence | Shalabh, IIT Kanpur 6 Since ' Crrjjjj So 22' SC()ijj Srr () ijj bbjj DFBETAS ii() ji, 2 SC()ijj reji, i 1 1 h 2' ii Srr()ijj r t ji, i ' 1 h rrjj ii Measures Measures leverage effect (impact of ith of observation large on bi residuals 2 where t is the ith R -student residual. Now if DFBETAS , then ith observation warrants i ji, n examination. 2. DFFITS: The deletion influence of ith observation on the predicted or fitted value can be investigated by using diagnostic by Belsley, Kuh and Welsch as yyˆˆ DFFITSii(),1,2,..., i n i 2 Sh()iii th where yˆ()i is the fitted value of yi obtained without the use of the i observation. The denominator is just a 2 standardization, since Var()yîii h . th This DFFITSi is the number of standard deviations that the fitted value yî changes of i observation is removed. Regression Analysis | Chapter 6 | Diagnostic for Leverage and Influence | Shalabh, IIT Kanpur 7 Computationally, heii i DFFITSi 1 hii Sh()iii1 hii ti 1 hii Ristudent leverage ofth observation where ti is R -student. If the data point is an outlier, then R -student will be large is magnitude. If the data point has high leverage, then hii will be close to unity. In either of these cases, DFFITSi can be large. If hii 0, then the effect of R -student will be moderated. If R -student is near to zero, then combined with high leverage point, then DFFITSi can be a small value. Thus DFFITSi is affected by both leverage and prediction error. Belsley, Kuh and Welsch suggest that any observation for which k DFFITS 2 i n warrants attention.

Chapter 6 Diagnostic for Leverage and Influence

LSAR: Efficient Leverage Score Sampling Algorithm for the Analysis

Remedies for Assumption Violations and Multicollinearity Outliers • If the Outlier Is Due to a Data Entry Error, Just Correct the Value

Leveraging for Big Data Regression

Elemental Estimates, Influence, and Algorithmic Leveraging

Leverage (Statistics)

A Statistical Perspective on Algorithmic Leveraging

Chapter 10 NOTES Model Building – II: Diagnostics

Calculating Standard Errors for Linear Regression in Randomized Evaluations

Tests for Leverage and Influential Points

Outliers, Leverage, and Influence

Unusual and Influential Data

Outliers, Durbin-Watson and Interactions for Regression in SPSS Dependent Variable: Continuous (Scale/Interval/Ratio) Independent Variables: Continuous/ Binary