Chapter 4: Model Adequacy Checking

Total Page:16

File Type:pdf, Size:1020Kb

Chapter 4: Model Adequacy Checking Chapter 4: Model Adequacy Checking In this chapter, we discuss some introductory aspect of model adequacy checking, including: • Residual Analysis, • Residual plots, • Detection and treatment of outliers, • The PRESS statistic • Testing for lack of fit. The major assumptions that we have made in regression analysis are: • The relationship between the response Y and the regressors is linear, at least approximately. • The error term ε has zero mean. • The error term ε has constant varianceσ 2 . • The errors are uncorrelated. • The errors are normally distributed. Assumptions 4 and 5 together imply that the errors are independent. Recall that assumption 5 is required for hypothesis testing and interval estimation. Residual Analysis: The residuals , , , have the following important properties: e1 e2 L en (a) The mean of is 0. ei (b) The estimate of population variance computed from the n residuals is: n n 2 2 ∑()ei−e ∑ei ) 2 = i=1 = i=1 = SS Re s = σ n − p n − p n − p MS Re s (c) Since the sum of is zero, they are not independent. However, if the number of ei residuals ( n ) is large relative to the number of parameters ( p ), the dependency effect can be ignored in an analysis of residuals. Standardized Residual: The quantity = ei ,i = 1,2, , n , is called d i L MS Re s standardized residual. The standardized residuals have mean zero and approximately unit variance. A large standardized residual ( > 3 ) potentially indicates an outlier. d i Recall that e = (I − H )Y = (I − H )(X β + ε )= (I − H )ε Therefore, / Var()e = var[]()I − H ε = (I − H )var(ε )(I −H ) = σ 2 ()I − H . Studentized Residual: The quantity = ei = ei ,i = 1,2, , n , ti L (1− ) 2 (1− ) MS Re s hii S hii is called the studentized residual. The studentized residuals have approximately a Student's t distribution with n − p degrees of freedom. If we delete the i th observation, fit the regression model to the remaining n −1observations, and calculate the predicted value of corresponding to the deleted yi observation, the corresponding predictor error is = − ) . Generally a large ei yi y(−i) difference between the ordinary residual and the PRESS residual will indicate a point where the model fits the data well, but a model without that point predicts poorly. These prediction errors are usually called PRESS residuals or deleted residuals. It can be shown that = ei . Therefore, e(−i) 1− hii 2 ⎡ ⎤ 1 1 2 Var = Var ei = Var = 1− = σ ()e(−i) ⎢ ⎥ 2 ()ei 2 [σ ()hii ] ⎢1− ⎥ 1− ⎣ hii ⎦ ()1−hii ()1−hii hii Note that a standardized PRESS residual is ei ()1− e(−i) = hii = ei Var() 2 2 1− ei σ σ ()hii 1− ()hii which, if we use to estimate 2 is just the studentized residual. MS Re s σ R-student Residual: The quantity = ei , i = 1,2, , n , is called the R- ri L 2 (1− ) S (−i) hii student residual or jackknife residuals, where the quantity 2 is the residual variance S (−i) computed with the i th observation removed. It can be shown that 2 (n − p) − ei MS Re s 1− 2 = hii S (−i) n − p −1 If the usual assumptions in regression analysis are met, the jackknife residual follows exactly a t -distribution with n − p −1 degrees of freedom. Example 1: Consider the following data: y x1 x2 16 7 5 11 3 4 12 3 6 14 4 1 10 5 2 ⎡1 7 5⎤ ⎢ ⎥ ⎡16⎤ ⎢ ⎥ ⎡ ⎤ ⎢ ⎥ ⎢1 3 4⎥ ⎢ 5 22 18⎥ ⎢11⎥ ⎢ ⎥ / ⎢ ⎥ y = ⎢12⎥ , X = ⎢1 3 6⎥ ⇒ X X = ⎢22 108 79⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢14⎥ ⎢1 4 1⎥ ⎢18 79 82⎥ ⎣⎢10⎦⎥ ⎢ ⎥ ⎣⎢ ⎦⎥ ⎢ ⎥ 1 5 2 ⎣⎢ ⎦⎥ ⎡ ⎤ ⎢ 2.7155 − 0.3967 − 0.2139⎥ −1 ⎢ ⎥ / = ⎢− 0.3967 0.0893 0.0010 ⎥ ()X X ⎢ ⎥ ⎢ ⎥ ⎢− 0.2139 0.0010 0.0582 ⎥ ⎣ ⎦ ⎡1 7 5⎤ ⎢ ⎥ ⎡ ⎤ ⎢ ⎥ 2.8645 − 0.3712 − 0.2592 ⎡ ⎤ ⎢1 3 4⎥⎢ ⎥⎢1 1 1 1 1 ⎥ −1 ⎢ ⎥⎢ ⎥ / / ⎢ ⎥ H =X ()X X X = ⎢1 3 6⎥⎢− 0.3712 0.0936 − 0.0067⎥⎢7 3 3 4 5⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢1 4 1⎥⎢ ⎥⎢5 4 6 1 2⎥ ⎢− 0.2592 − 0.0067 0.0719⎥ ⎢ ⎥⎣ ⎦⎣⎢ ⎦⎥ ⎢ ⎥ 1 5 2 ⎣⎢ ⎦⎥ ⎡ ⎤ ⎢ 0.9252 − 0.0935 0.0748 − 0.1121 0.2056⎥ ⎢ ⎥ ⎢− 0.0935 0.3832 0.4268 0.1931 0.0903 ⎥ ⎢ ⎥ ⎢ ⎥ H = ⎢ 0.0748 0.4268 0.7030 − 0.1101 − 0.0945⎥ ⎢ ⎥ ⎢− 0.1121 0.1931 − 0.1101 0.6096 0.4195 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ 0.2056 0.0903 − 0.0945 0.4195 0.3790 ⎣⎢ ⎦⎥ ⇒ = 0.9252, = 0.3832, = 0.7030, = 0.6096, = 0.3790 h11 h22 h33 h44 h55 ⎡ ⎤ 0.0748 0.0935 − 0.0748 0.1121 − 0.2056 ⎢ ⎥ ⎡ ⎤ ⎢ ⎥ ⎢ 0.84 ⎥ ⎢ ⎥⎡16⎤ 0.0935 0.6168 − 0.4268 − 0.1931 − 0.0903 ⎢ ⎥ ⎢− 0.45⎥ ⎢ ⎥ 11 ⎢ ⎥ ⎢ ⎥⎢ ⎥ = − = − 0.0748 − 0.4268 0.2970 0.1101 0.0945 ⎢12⎥ = ⎢ 0.16 ⎥ e ()I H y ⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢14⎥ ⎢ ⎥ ⎢ 0.1121 − 0.1931 0.1101 0.3904 − 0.4195 ⎥ ⎢ 2.26 ⎥ ⎢ ⎥⎢10⎥ ⎣ ⎦ ⎢− 2.81⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎣ ⎦ − 0.2056 − 0.0903 0.0945 − 0.4195 0.6210 ⎣⎢ ⎦⎥ ' 13.9374 = ee = = 6.97 MS Re s n − p 2 ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ 0.84 0.32 d1 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢− 0.45⎥ ⎢− 0.17⎥ ⎢d 2⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ 1 = e = ⎢ 0.16 ⎥ = ⎢ 0.06 ⎥ ⎢ 3⎥ d 6.97 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ MS Re s ⎢ ⎥ ⎢ ⎥ ⎢d 4⎥ ⎢ 2.26 ⎥ ⎢ 0.86 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎣d 5⎦ − 2.81 −1.06 ⎣⎢ ⎦⎥ ⎣⎢ ⎦⎥ ⎡ ⎤ e1 ⎢ ⎥ ⎡ 0.84 ⎤ ⎢ 1− ⎥ MS Re s ()h11 ⎢ ⎥ ⎢ ⎥ ⎢ 6.97(1− 0.9252) ⎥ ⎢ ⎥ ⎢ ⎥ ⎡ ⎤ ⎡ ⎤ e1 − 0.45 1.16 1 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢t ⎥ 1− ⎢ MS Re s ()h22 ⎥ ⎢ 6.97(1− 0.3832) ⎥ ⎢− 0.22⎥ ⎢t2⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ 0.16 = ⎢ e1 ⎥ = ⎢ ⎥ = ⎢ 0.11 ⎥ ⎢t3⎥ ⎢ ()1− ⎥ ⎢ 6.97(1− 0.7030) ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ MS Re s h33 ⎥ ⎢ ⎥ ⎢ ⎥ t4 1.37 ⎢ ⎥ ⎢ ⎥ ⎢ 2.26 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ e1 ⎥ ⎢ ⎥ ⎣t5⎦ ⎢ 6.97(1− 0.6096) ⎥ −1.35 ⎢ 1− ⎥ ⎢ ⎥ MS Re s ()h44 ⎢ ⎥ ⎣ ⎦ ⎢ ⎥ ⎢ − 2.81 ⎥ ⎢ ⎥ e1 ⎢ ⎥ ⎢ ⎥ ⎣ 6.97(1− 0.3790) ⎦ 1− ⎣⎢ MS Re s ()h55 ⎦⎥ 2 e1 (n − p) − 2 MS Re s 1− (5 − 3)6.97 − 0.84 2 = h11 = 1−0.9252 = 4.5 S (−1) n − p −1 5 − 3 −1 2 (n − p) − e2 2 Re s (−0.45) MS 1− (5 − 3)6.97 − 2 = h22 = 1−0.3832 = 13.6 S (−2) n − p −1 5 − 3 −1 2 e3 (n − p) − 2 MS Re s 1− (5 − 3)6.97 − 0.16 2 = h33 = 1−0.7030 = 13.9 S (−3) n − p −1 5 − 3 −1 2 e44 (n − p) − 2 MS Re s 1− (5 − 3)6.97 − 2.26 2 = h44 = 1−0.6096 = 0.86 S (−4) n − p −1 5 − 3 −1 2 (n − p) − e55 2 Re s (−2.81) MS 1− (5 − 3)6.97 − 2 = h55 = 1−0.3790 = 1.22 S (−5) n − p −1 5 − 3 −1 ⎡ ⎤ ⎢ e1 ⎥ ⎢ 2 ⎥ ()1− ⎡ 0.84 ⎤ ⎢ S (−1) h11 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ 4.5(1− 0.9252) ⎥ ⎢ e1 ⎥ ⎡ ⎤ ⎡ ⎤ ⎢ − 0.45 ⎥ 1.45 r(−1) ⎢ 2 ⎥ ⎢ ⎥ ⎢ ⎥ ()1− ⎢ ⎥ ⎢ ⎥ ⎢ S (−2) h22 ⎥ ⎢ 13.6(1− 0.3832) ⎥ ⎢− 0.15⎥ r(−2) ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ 0.16 ⎢ ⎥ = ⎢ e1 ⎥ = ⎢ ⎥ = ⎢ 0.08 ⎥ r(−3) ⎢ 2 ⎥ ⎢ ⎥ ⎢ ⎥ ()1− ⎢ 13.9(1− 0.7030) ⎥ ⎢ S (−3) h33 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢r(−4)⎥ 3.90 ⎢ ⎥ ⎢ ⎥ ⎢ 2.26 ⎥ ⎢ ⎥ ⎢ e1 ⎥ ⎢ ⎥ ⎢ 3.23⎥ ⎢ (−5)⎥ 0.86(1− 0.6096) − ⎣r ⎦ ⎢ 2 ⎥ ⎢ ⎥ ()1− ⎢ ⎥ ⎣ ⎦ ⎢ S (−4) h44 ⎥ ⎢ − 2.81 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ e1 ⎥ ⎣ 1.22(1− 0.3790) ⎦ ⎢ 2 ⎥ ()1− ⎣⎢ S (−5) h55 ⎦⎥ SAS Output: Residuals, Studentized Residuals and R-student Residuals Obs Residuals student Rstudent 1 0.84112 1.16423 1.45010 2 -0.44860 -0.21618 -0.15468 3 0.15888 0.11034 0.07826 4 2.26168 1.36988 3.89917 5 -2.81308 -1.35107 -3.23320 Scat t er pl ot of X2 ver sus X1 x1 7 6 5 4 3 123456 x2 Graphical Analysis of Residuals: (a) Normal probability plot: If the normality assumption is not badly violated, the conclusion reached by a regression analysis in which normality is assumed will generally be reliable and accurate. A very simple method of checking the normality assumption is to construct a normal probability plot of residuals. Let , , , be the residuals ranked in increasing order. Note that e(1) e(2) L e(n) ⎛ 1 ⎞ ⎜ i − ⎟ E()= −1⎜ 2 ⎟ e(i) Φ ⎜ n ⎟ ⎜ ⎟ ⎝ ⎠ where Φ denotes the standard normal cumulative distribution. Normal probability plots are constructed by plotting the ranked residuals against the expected normal e(i) ⎛ 1 ⎞ ⎜ i − ⎟ value −1⎜ 2 ⎟. The resulting points should lie approximately on a straight line. Φ ⎜ n ⎟ ⎜ ⎟ ⎝ ⎠ Substantial departures from a straight line indicate that the distribution is not normal. If normality is deemed unsatisfactory, the Y values may be transformed by using a Log, square root, etc. to see whether the new set of observation is approximately normal. (b) Plot of Residuals versus the Fitted values: A plot of the residuals (or the ei scaled residuals , or ) versus the corresponding fitted values ) is d i ti r i yi useful for detecting several common types of model inadequacies. If the plot of residuals versus the fitted values can be contained in a horizontal band, then there are no obvious model defects. The outward-opening funnel pattern implies that the variance of ε is an increasing function of Y . An inward-opening funnel indicates that the variance of ε decrease as Y increases. The double-bow often occurs when Y is a proportion between zero and one. The usual approach for dealing with inequality of variance is to apply a suitable transformation to either the regressor or the response variable. A curved plot indicates nonlinearity. This could mean that other regressor variables are needed in the model. For example a squared term may be necessary. Transformation on the regressor and/or the response variable may be helpful in these cases. A plot of residuals versus the predicted values may also reveal one or more unusually large residuals.
Recommended publications
  • Outlier Detection and Influence Diagnostics in Network Meta- Analysis
    Outlier detection and influence diagnostics in network meta- analysis Hisashi Noma, PhD* Department of Data Science, The Institute of Statistical Mathematics, Tokyo, Japan ORCID: http://orcid.org/0000-0002-2520-9949 Masahiko Gosho, PhD Department of Biostatistics, Faculty of Medicine, University of Tsukuba, Tsukuba, Japan Ryota Ishii, MS Biostatistics Unit, Clinical and Translational Research Center, Keio University Hospital, Tokyo, Japan Koji Oba, PhD Interfaculty Initiative in Information Studies, Graduate School of Interdisciplinary Information Studies, The University of Tokyo, Tokyo, Japan Toshi A. Furukawa, MD, PhD Departments of Health Promotion and Human Behavior, Kyoto University Graduate School of Medicine/School of Public Health, Kyoto, Japan *Corresponding author: Hisashi Noma Department of Data Science, The Institute of Statistical Mathematics 10-3 Midori-cho, Tachikawa, Tokyo 190-8562, Japan TEL: +81-50-5533-8440 e-mail: [email protected] Abstract Network meta-analysis has been gaining prominence as an evidence synthesis method that enables the comprehensive synthesis and simultaneous comparison of multiple treatments. In many network meta-analyses, some of the constituent studies may have markedly different characteristics from the others, and may be influential enough to change the overall results. The inclusion of these “outlying” studies might lead to biases, yielding misleading results. In this article, we propose effective methods for detecting outlying and influential studies in a frequentist framework. In particular, we propose suitable influence measures for network meta-analysis models that involve missing outcomes and adjust the degree of freedoms appropriately. We propose three influential measures by a leave-one-trial-out cross-validation scheme: (1) comparison-specific studentized residual, (2) relative change measure for covariance matrix of the comparative effectiveness parameters, (3) relative change measure for heterogeneity covariance matrix.
    [Show full text]
  • Residuals, Part II
    Biostatistics 650 Mon, November 5 2001 Residuals, Part II Key terms External Studentization Outliers Added Variable Plot — Partial Regression Plot Partial Residual Plot — Component Plus Residual Plot Key ideas/results ¢ 1. An external estimate of ¡ comes from refitting the model without observation . Amazingly, it has an easy formula: £ ¦!¦ ¦ £¥¤§¦©¨ "# 2. Externally Studentized Residuals $ ¦ ¦©¨ ¦!¦)( £¥¤§¦&% ' Ordinary residuals standardized with £*¤§¦ . Also known as R-Student. 3. Residual Taxonomy Names Definition Distribution ¦!¦ ¦+¨-,¥¦ ,0¦ 1 Ordinary ¡ 5 /. &243 ¦768¤§¦©¨ ¦ ¦!¦ 1 ¦!¦ PRESS ¡ &243 9' 5 $ Studentized ¦©¨ ¦ £ ¦!¦ ¤0> % = Internally Studentized : ;<; $ $ Externally Studentized ¦+¨ ¦ £?¤§¦ ¦!¦ ¤0>¤A@ % R-Student = 4. Outliers are unusually large observations, due to an unmodeled shift or an (unmodeled) increase in variance. 5. Outliers are not necessarily bad points; they simply are not consistent with your model. They may posses valuable information about the inadequacies of your model. 1 PRESS Residuals & Studentized Residuals Recall that the PRESS residual has a easy computation form ¦ ¦©¨ PRESS ¦!¦ ( ¦!¦ It’s easy to show that this has variance ¡ , and hence a standardized PRESS residual is 9 ¦ ¦ ¦!¦ ¦ PRESS ¨ ¨ ¨ ¦ : £ ¦!¦ £ ¦!¦ £ ¦ ¦ % % % ' When we standardize a PRESS residual we get the studentized residual! This is very informa- ,¥¦ ¦ tive. We understand the PRESS residual to be the residual at ¦ if we had omitted from the 3 model. However, after adjusting for it’s variance, we get the same thing as a studentized residual. Hence the standardized residual can be interpreted as a standardized PRESS residual. Internal vs External Studentization ,*¦ ¦ The PRESS residuals remove the impact of point ¦ on the fit at . But the studentized 3 ¢ ¦ ¨ ¦ £?% ¦!¦ £ residual : can be corrupted by point by way of ; a large outlier will inflate the residual mean square, and hence £ .
    [Show full text]
  • Chapter 39 Fit Analyses
    Chapter 39 Fit Analyses Chapter Table of Contents STATISTICAL MODELS ............................568 LINEAR MODELS ...............................569 GENERALIZED LINEAR MODELS .....................572 The Exponential Family of Distributions ....................572 LinkFunction..................................573 The Likelihood Function and Maximum-Likelihood Estimation . .....574 ScaleParameter.................................576 Goodness of Fit .................................576 Quasi-Likelihood Functions ..........................577 NONPARAMETRIC SMOOTHERS ......................580 SmootherDegreesofFreedom.........................581 SmootherGeneralizedCrossValidation....................582 VARIABLES ...................................583 METHOD .....................................585 OUTPUT .....................................587 TABLES ......................................590 ModelInformation...............................590 ModelEquation.................................590 X’X Matrix . .................................591 SummaryofFitforLinearModels.......................592 SummaryofFitforGeneralizedLinearModels................593 AnalysisofVarianceforLinearModels....................594 AnalysisofDevianceforGeneralizedLinearModels.............595 TypeITests...................................595 TypeIIITests..................................597 ParameterEstimatesforLinearModels....................599 ParameterEstimatesforGeneralizedLinearModels..............601 C.I.forParameters...............................602 Collinearity
    [Show full text]
  • Linear Regression Diagnostics*
    LIBRARY OF THE MASSACHUSETTS INSTITUTE OF TECHNOLOGY WORKING PAPER ALFRED P. SLOAN SCHOOL OF MANAGEMENT LINEAR REGRESSION DIAGNOSTICS* Soy E. Welsch and Edwin Kuh Massachusetts Institute of Technology and NBER Computer Research Center WP 923-77 April 1977 MASSACHUSETTS INSTITUTE OF TECHNOLOGY 50 MEMORIAL DRIVE CAMBRIDGE, MASSACHUSETTS 02139 ^ou to divest? access to expanded sub-regional markets possibility of royalty payments to parent access to local long term capital advantages given to a SAICA (i.e., reduce LINEAR REGRESSION DIAGNOSTICS^ Roy E. Welsch and Edwin Kuh Massachusetts Institute of Technology and NBER Computer Research Center WP 923-77 April 1977 *Me are indebted to the National Science Foundation for supporting this research under Grant SOC76-143II to the NBER Computer Research Center. - , ABSTFACT This paper atxeT.pts to provide the user of linear multiple regression with a batter;/ of iiagncstic tools to determine which, if any, data points have high leverage or irifluerice on the estimation process and how these pcssidly iiscrepar.t iata points -differ from the patterns set by the majority of the data. Tr.e point of viev; taken is that when diagnostics indicate the presence of anomolous data, the choice is open as to whether these data are in fact -unus-ual and helprul, or possioly hannful and thus in need of modifica- tions or deletion. The methodology/ developed depends on differences, derivatives, and decompositions of basic re:gressicn statistics. Th.ere is also a discussion of hov; these tecr-niques can be used with robust and ridge estimators, r^i exarripls is given showing the use of diagnostic methods in the estimation of a cross cou.ntr>' savir.gs rate model.
    [Show full text]
  • Studentized Residuals
    CHECKING ASSUMPTIONS/CONDITIONS AND FIXING PROBLEMS Recall, linear model is: YXiii01 where we assume: E{ε} = 0, Variance of the “errors” = σ2{ε} = σ2, same value at any X Sometimes assume errors are normally distributed (“normal errors” model). ˆ After we fit the model using data, the residuals eYYiii will be useful for checking the assumptions and conditions. But what is a large residual? If discussing GPA, 0.5 could be large but if discussion SAT scores (possible scores of 200 to 800), 0.5 would be tiny. Need to create equivalent of a z-score. Recall estimate of σ is sMSE * e i e i “Semi-studentized” residuals (p. 103) MSE Note these are called standardized residuals in R. Later, will learn another version, which R calls studentized residuals. Assumptions in the Normal Linear Regression Model A1: There is a linear relationship between X and Y. A2: The error terms (and thus the Y’s at each X) have constant variance. A3: The error terms are independent. A4: The error terms (and thus the Y’s at each X) are normally distributed. Note: In practice, we are looking for a fairly symmetric distribution with no major outliers. Other things to check (Questions to ask): Q5: Are there any major outliers in the data (X, or combination of (X,Y))? Q6: Are there other possible predictors that should be included in the model? Applet for illustrating the effect of outliers on the regression line and correlation: http://illuminations.nctm.org/LessonDetail.aspx?ID=L455 Useful Plots for Checking Assumptions and Answering These Questions Reminders: ˆ Residual
    [Show full text]
  • Cal Analysis of Residual Scores and Outlier Detection in Bivariate Least Squares Regression Analysis
    DOCUMENT RESUME ED 395 949 TM 025 016 AUTHOR Serdahl, Eric TITLE An Introduction to Grapl- cal Analysis of Residual Scores and Outlier Detection in Bivariate Least Squares Regression Analysis. PUB DATE Jan 96 NOTE 29p.; Paper presented at the Annual Meeting of the Southwest Educational Research Association (New Orleans, LA, January 1996). PUB TYPE Reports Evaluative/Feasibility (142) Speeches/Conference Papers (150) EDRS PRICE MF01/PCO2 Plus Postage. DESCRIPTORS *Graphs; *Identification; *Least Squares Statistics; *Regression (Statistics) ;Research Methodology IDENTIFIERS *Outliers; *Residual Scores; Statistical Package for the Social Sciences PC ABSTRACT The information that is gained through various analyses of the residual scores yielded by the least squares regression model is explored. In fact, the most widely used methods for detecting data that do not fit this model are based on an analysis of residual scores. First, graphical methods of residual analysis are discussed, followed by a review of several quantitative approaches. Only the more widely used approaches are discussed. Example data sets are analyzed through the use of the Statistical Package for the Social Sciences (personal computer version) to illustrate the various strengths and weaknesses of these approaches and to demonstrate the necessity of using a variety of techniques in combination to detect outliers. The underlying premise for using these techniques is that the researcher needs to make sure that conclusions based on the data are not solely dependent on one or two extreme observations. Once an outlier is detected, the researcher must examine the data point's source of aberration. (Contains 3. figures, 5 tables, and 14 references.) (SLD) * Reproductions supplied by EDRS are the best that can be made from the original document.
    [Show full text]
  • Simple Linear Regression
    Lectures 9: Simple Linear Regression Lectures 9: Simple Linear Regression Junshu Bao University of Pittsburgh 1 / 32 Lectures 9: Simple Linear Regression Table of contents Introduction Data Exploration Simple Linear Regression Utility Test Assess Model Adequacy Inference about Linear Regression Model Diagnostics Models without an Intercept 2 / 32 Lectures 9: Simple Linear Regression Introduction Simple Linear Regression The simple linear regression model is yi = β0 + β1xi + i where th I yi is the i observation of the response variable y th I xi is the i observation of the explanatory variable x 2 I i is an error term and ∼ N(0; σ ) I β0 is the intercept I β1 is the slope of linear relationship between y and x The simple" here means the model has only a single explanatory variable. 3 / 32 Lectures 9: Simple Linear Regression Introduction Least-Squares Method Choose β0 and β1 that minimize the sum of the squares of the vertical deviations. n X 2 Q(β0; β1) = [yi − (β0 + β1xi)] : i=1 Taking partial derivatives of Q(β0; β1) and solve a system of equations yields the least squares estimators: β^0 = y − β^1x Pn ^ i=1(xi − x)(yi − y) Sxy β1 = Pn 2 = : i=1(xi − x) Sxx 4 / 32 Lectures 9: Simple Linear Regression Introduction Predicted Values and Residuals I Predicted (estimated) values of y from a model y^i = β^0 + β^1xi I Residuals ei = yi − y^i I Play an important role in diagnosing a tted model. I Sometimes standardized before use (dierent options from dierent statisticians) 5 / 32 Lectures 9: Simple Linear Regression Introduction Alcohol and Death Rate A study in Osborne (1979): I Independent variable: Average alcohol consumption in liters per person per year I Dependent variable: Death rate per 100,000 people from cirrhosis or alcoholism I Data on 15 countries (No.
    [Show full text]
  • Diagnosing Problems in Linear and Generalized Linear Models 6
    Diagnosing Problems in Linear and Generalized Linear Models 6 egression diagnostics are methods for determining whether a fitted regres- R sion model adequately represents the data. We use the term regression broadly in this chapter to include methods for both linear and generalized lin- ear models, and many of the methods described here are also appropriate for other regression models. Because most of the methods for diagnosing prob- lems in linear models extend naturally to generalized linear models, we deal at greater length with linear-model diagnostics, briefly introducing the exten- sions to GLMs. Linear models fit by least squares make strong and sometimes unrealis- tic assumptions about the structure of the data. When these assumptions are violated, least-squares estimates can behave badly and may even completely misrepresent the data. Regression diagnostics can reveal such problems and often point the way toward solutions. Section 6.1 describes various kinds of residuals in linear models, and Section 6.2 introduces basic scatterplots of residuals, along with related plots that are used to assess the fit of a model to data. The remaining sections are specialized to particular problems, describing methods for diagnosis and at least touching on possible remedies. Section 6.3 introduces methods for detecting unusual data, including outliers, high-leverage points, and influen- tial observations. Section 6.4 returns to the topic of transformations of the response and predictors (discussed previously in Section 3.4) to correct prob- lems such as nonnormally distributed errors and nonlinearity. Section 6.5 deals with nonconstant error variance. Section 6.6 describes the extension of diagnostic methods to GLMs such as logistic and Poisson regression.
    [Show full text]
  • Diagnostics for Logistic Regression an Important Part of Model Testing Is Examining Your Model for Indications That Statistical Assumptions Have Been Violated
    Newsom Psy 525/625 Categorical Data Analysis, Spring 2021 1 Diagnostics for Logistic Regression An important part of model testing is examining your model for indications that statistical assumptions have been violated. This diagnostic process involves a considerable amount of judgement call, because there are not typically any definitive statistical tests that can be used to provide assurance that the model meets assumptions or not. One reason that diagnosis is somewhat of a judgement call is that assumptions, such as normality of errors, applies to the population, which we do not have definitive information about. This means that the sample data may be expected to depart from what is expected by the assumption even if there is no real violation in the population. Assumptions with Logistic Regression I will give a brief list of assumptions for logistic regression, but bear in mind, for statistical tests generally, assumptions are interrelated to one another (e.g., heteroscedasticity and independence of errors) and different authors word them differently or include slightly different lists. I will not discuss several assumptions—independence of errors/observations, correctly specified model (all relevant predictors included), correct functional form, absence of multicollinearity, fixed predictors (measured without error)—in detail here, because they are common to ordinary least squares regression (see Cohen, Cohen, West, & Aiken, 2003, for a good introduction). There are a couple of other special numerical problems that occur with logistic regression that I will also address here. An important assumption of logistic regression is that the errors (residuals) of the model are approximately normally distributed. The observed values on the response variable cannot be normally distributed themselves, because Y is binary.
    [Show full text]
  • SAS/STAT® 13.2 User’S Guide the REG Procedure This Document Is an Individual Chapter from SAS/STAT® 13.2 User’S Guide
    SAS/STAT® 13.2 User’s Guide The REG Procedure This document is an individual chapter from SAS/STAT® 13.2 User’s Guide. The correct bibliographic citation for the complete manual is as follows: SAS Institute Inc. 2014. SAS/STAT® 13.2 User’s Guide. Cary, NC: SAS Institute Inc. Copyright © 2014, SAS Institute Inc., Cary, NC, USA All rights reserved. Produced in the United States of America. For a hard-copy book: No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of the publisher, SAS Institute Inc. For a Web download or e-book: Your use of this publication shall be governed by the terms established by the vendor at the time you acquire this publication. The scanning, uploading, and distribution of this book via the Internet or any other means without the permission of the publisher is illegal and punishable by law. Please purchase only authorized electronic editions and do not participate in or encourage electronic piracy of copyrighted materials. Your support of others’ rights is appreciated. U.S. Government License Rights; Restricted Rights: The Software and its documentation is commercial computer software developed at private expense and is provided with RESTRICTED RIGHTS to the United States Government. Use, duplication or disclosure of the Software by the United States Government is subject to the license terms of this Agreement pursuant to, as applicable, FAR 12.212, DFAR 227.7202-1(a), DFAR 227.7202-3(a) and DFAR 227.7202-4 and, to the extent required under U.S.
    [Show full text]
  • Lecture 20: Outliers and Influential Points 1 Modified Residuals
    Lecture 20: Outliers and Influential Points An outlier is a point with a large residual. An influential point is a point that has a large impact on the regression. Surprisingly, these are not the same thing. A point can be an outlier without being influential. A point can be influential without being an outlier. A point can be both or neither. Figure 1 shows four famous datasets due to Frank Anscombe. If you run least squares on each dataset you will get the same output: Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.0001 1.1247 2.667 0.02573 * x 0.5001 0.1179 4.241 0.00217 ** Residual standard error: 1.237 on 9 degrees of freedom Multiple R-squared: 0.6665,Adjusted R-squared: 0.6295 F-statistic: 17.99 on 1 and 9 DF, p-value: 0.00217 The top left plot has no problems. The top right plot shows a non-linear pattern. The bottom left plot has an an outlier. The bottom right plot has an influential point. Imagine what would happen if we deleted the rightmost point. If you looked at residual plots, you would see problems in the second and third case. But the resdual plot for the fourth example would look fine. You can't see influence in the usual residual plot. 1 Modified Residuals Let e be the vector of residuals. Recall that e = (I − H), E[e] = 0; Var(e) = σ2(I − H): p Thus the standard error of ei is σb 1 − hii where hii ≡ Hii.
    [Show full text]
  • Nag Regsn Std Resid Influence (G02fac)
    g02 – Regression Analysis g02fac nag regsn std resid influence (g02fac) 1. Purpose nag regsn std resid influence (g02fac) calculates two types of standardised residuals and two measures of influence for a linear regression. 2. Specification #include <nag.h> #include <nagg02.h> void nag_regsn_std_resid_influence(Integer n, Integer ip, Integer nres, double res[], double h[], double rms, double sres[], NagError *fail) 3. Description For the general linear regression model is defined by y = Xβ + ε where y is a vector of length n of the dependent variable, X is an n by p matrix of the independent variables, β is a vector of length p of unknown parameters, and ε is a vector of length n of unknown random errors such that var ε = σ2I. The residuals are given by r = y − yˆ = y − Xβ.ˆ The fitted values,y ˆ = Xβˆ, can be written as Hy for an n by n matrix H.Theith diagonal element of H, hi, gives a measure of the influence of the ith value of the independent variables on the fitted regression model. The values of r and the hi are returned by nag regsn mult linear (g02dac). nag regsn std resid influence calculates statistics which help to indicate if an observation is extreme and having an undue influence on the fit of the regression model. Two types of standardised residual are calculated (a) The ith residual is standardised by its variance when the estimate of σ2, s2, is calculated from all the data; known as internal studentization. ri RIi = . s 1 − hi 2 2 (b) The ith residual is standardised by its variance when the estimate of σ , s−i is calculated from the data excluding the ith observation; known as external studentization.
    [Show full text]