DIAGNOSTICS in MULTIPLE REGRESSION the Data Are in the Form
Total Page:16
File Type:pdf, Size:1020Kb
PubH 7405: REGRESSION ANALYSIS DIAGNOSTICS IN MULTIPLE REGRESSION The data are in the form : {( yi ; x1i , x2i ,, xki )}i=1,,n Multiple Regression Model : Y = β0 + β1x1 + β2 x2 +βk xk + ε ε ∈ N(0,σ 2 ) The error terms are identically and independently distributed as normal with a constant variance. A simple strategy for the building of a regression model consists of some, most, or all of the following five steps or phases: (1) Data collection and preparation, (2) Preliminary model investigation, (3) Reduction of the predictor variables, (4) Model refinement and selection, and (5) Model validation Let say we finished step #3; we tentatively have a good model: it’s time for some fine tuning! Diagnostics to identify possible violations of model’s assumptions are often focused on these “major issues”: (1) Non-linearity, (2) Outlying & influential cases, (3) Multi-collinearity, (4) Non-constant variance & (5) Non-independent errors. DETECTION OF NON-LINEARITY A limitation of the usual residual plots (say, residuals against values of a predictor variable): they may not show the nature of the “additional contribution” of a predictor variable to those by other variables already in the model. For example, we consider a multiple regression model with 2 independent variables X1 and X2; is the relationship between Y and X1 linear? “Added-variable plots” (also called “partial regression plots” or “adjusted variable plots”) are refined residual plots that provide graphic information about the marginal importance of a predictor variable given the other variables already in the model. ADDED-VARIABLE PLOTS In an added-variable plot, both the response variable Y and the predictor variable under investigation (say, X1) are both regressed against the other predictor variables already in the regression model and the residuals are obtained for each. These two sets of residuals reflect the part of each (Y and X1) that is not linearly associated with the other predictor variables. The plot of one set of residuals against the other set would show the marginal contribution of the candidate predictor in reducing the residual variability as well as the information about the nature of its marginal contribution. A SIMPLE & SPECIFIC EXAMPLE Consider the case in which we already have a regression model of Y on predictor variable X2 and is now considering if we should add X1 into the model (if we do, we would have a multiple regression model of Y on (X1,X2)). In order to decide, we investigate 2 simple linear regression models: (a) The regression of Y on X2 and (b) The regression of X1 on X2 and obtain 2 sets of residuals as follows: Regression #1: Y on X2 (first or old model) ^ Y i = b0 + b2 X i2 ^ ei (Y | X 2 ) = Yi −Y i These residuals represent the part of Y not explained by X2 Regression #2 : X1 on X2 (second or new model) ^ * * X 1i = b0 + b2 X i2 ^ ei (X1 | X 2 ) = X1i − X 1i These residuals represent the part of X1 not contained in X2 We now do a “regression of ei(Y|X2) – as new dependent variable on ei(X1|X2) – as independent variable: That is to see if “the part of X1 not contained in X2” can further explained “the part of Y not explained by X2”; (if it can, X1 should be added to the model for Y which already has X2 in it). There are three possibilities: The “horizontal band” shows that X1 contains no “additional information” useful for the prediction of Y beyond that contained in and provided for by X2 This “linear band with a non-zero slope” indicates that “the part of X1 not contained in X2” is linearly related to “the part of Y not explained by X2” alone. That is, X1 should be added to form a 2-variable model. Note that if we do a “regression through the origin”, the slope b1 should be the same as the coefficient of X1 if it is added to the regression model already containing X2. AN EXAMPLE Suppose : ^ (1) Y (X 2 ) = 50.70 +15.54X 2 ^ (2) X 1(X 2 ) = 40.78 +1.72X 2 and : (3) e(Y | X 2 ) = 6.29e(X1 | X 2 ) Then : [Y − (50.70 +15.54X 2 )] = 6.29[X1 − (40.78 +1.72X 2 )], we have the s ame result as in multiple regression fitting : Y = [50.70 − (6.29)(40.78)] + 6.29X1 +[15.54 − (6.29)(1.72)]X 2 Y = −205.81+ 6.29X1 + 4.72X2 The “curvilinear band” indicates that, like the case (b) previously, X1 should be added to the model already containing X2. However, it further suggests that the inclusion of X1 is justified but some power terms – or some type of data transformation – are needed. The fact that an added-variable plot may suggest “the nature of the functional form” in which a predictor variable should be added to the regression model is more important than that the variable possible inclusion (which can be resolved without graphical help). The added-variable plots play the role that we use scatter diagrams for in simple linear regression; they would tell if data transformation or if certain polynomial model is desirable. ADDED-VARIABLE PLOT PROC REG data = Origset; 2 Simple Linear Regression here Model Y X1 = X2/noprint; Output Out = SaveMe R = YResid X1Resid; Run; PROC PLOT data = SaveMe; Plot YResid*X1Resid; Run; Another Example: Suppose we have a dependent variable Y and 5 predictor variables X1-X5; and assume that we focus on X5, our primary predictor variable: (1) we regress Y on X1-X4 and obtain residuals (say, YR), and (2) we regress X5 on X1-X4 and obtain residuals (say, X5R), then (3) we regress YR on X5R (SLR, no intercept) – linear assumption about X5 can be studied from this last SLR & its (added-value) plot. “MARGINAL” SL REGRESSION PROC REG data = Origset; Model Y X5 = X1 X2 X3 X4/noprint; Output Out = SaveMe R = YR X5R; Run; 2 Multiple Linear Regression here PROC PLOT data = SaveMe; Plot YR*X5R; New: Run; Marginal Simple Linear Regression PROC REG data = SaveMe; Model YR = X5R/ noint; Run; REMEDIAL MEASURES • If a “linear model” is found not appropriate for the added value of certain predictor, there are two choices: (1) Add in a power terms (quadratic), or (2) Use some transformation on the data to create a fit for the transformed data Each has advantages & disadvantages: first approach (quadratic model) may yield better insights but may lead to more technical difficulties; transformations (log, reciprocal, etc…) are more simple but may obscure the fundamental real relationship between Y and that predictor. One is hard “to do” and one is hard “to explain” OUTLYING & INFLUENTIAL CASES SEMI-STUDENTIZED RESIDUALS _ * ei − e ei ei = = MSE MSE If √MSE were an estimate of the standard deviation of the residual e, we would call e* a studentized (or standardized) residual. However, the standard deviation of the residual is complicated and varies for different residuals, and √MSE is only an approximation. Therefore, e* is call a “semi-studentized residual”. But “exact” standard deviations can be found! THE HAT MATRIX ^ Y = Xb = X[(X' X) −1 X' Y] = [X(X' X) −1 X' ]Y = HY H = X(X' X) −1 X' is called the "Hat Matrix" The Hat matrix can be obtained from data RESIDUALS Model : Y = Xβ + ε Fitted Value : ^ Y = Xb Residuals : ^ e = Y − Y = Y − Xb = Y − HY = (I − H)Y Like the Hat Matrix H, (I-H) is symmetric & idempotent VARIANCE OF RESIDUALS e = (I − H)Y σ2 (e) = (I − H)σ2 (Y)(I − H)' = (I − H)(σ 2I)(I − H)' = σ 2 (I − H)(I − H)' = σ 2 (I − H)(I − H) = σ 2 (I − H) ^ = MSE(I − H) For the ith observatio n : 2 2 σ (ei ) = (1 − hii )σ 2 s (ei ) = (1− hii )MSE 2 σ (ei ,e j ) = −hijσ ;i ≠ j s(ei ,e j ) = −hijMSE;i ≠ j hii is the ith element on the main diagonal & hij is on ith row and jth column of the hat matrix STUDENTIZED RESIDUALS ei ri = s(ei ) e = i (1− hii )MSE Studentized residuals is a “refine” version of the semi- studentized residuals; in “r” we use the exact standard deviation of the residual “e” – not an approximation. Semi-studentized residuals and Studentized Residuals are tools for detecting Outliers. DELETED RESIDUALS The second refinement to make residuals more effective for detecting outlying or extreme observations is to measure the ith residual when the fitted regression is based on all cases except the ith case (similar to the concept of jackknifing). The reason is to avoid the influence of the ith case – especially if it is an outlying observation - on the fitted value. If the ith case is an outlying observation, its exclusion will make the “deleted residual” larger and, therefore, more likely to “confirm” its outlying status: ^ di = Yi −Y i(i) STUDENTIZED DELETED RESIDUALS Combining the two refinements, the studentized residual and the deleted residual, we get the “Studentized Deleted Residual”. A studentized residual is also called an internal studentized residual and a studentized deleted residual is an external studentized residual; MSE(i) is the mean square error when the ith case is omitted in fitting the regression model. di ti = s(di ) e = i (1− hii )MSE(i) ei ti = (1− hii )MSE(i) The studentized deleted residuals can be calculated without having to fit new regression functions each time a different case is omitted; that is, we need to fit the model only once – with all n cases to obtain MSE – not the same model n times, here “p” is the number of parameters, p = k+1: 2 ei (n − p)MSE = (n − p −1)MSE(i) + 1− hii 2 ei (n − p)MSE = (n − p −1)MSE(i) + 1− hii leading to : n − p −1 = ti ei 2 SSE(1 − hii ) − ei which is distribute d as "t" with (n − p −1) degrees of freedom under H0 of no outliers.