Chapter 1: the Purpose of This Dissertation
Total Page:16
File Type:pdf, Size:1020Kb
ABSTRACT Title of Dissertation: REGRESSION DIAGNOSTICS FOR COMPLEX SURVEY DATA: IDENTIFICATION OF INFLUENTIAL OBSERVATIONS. Jianzhu Li, Doctor of Philosophy, 2007 Dissertation Directed By: Professor Richard Valliant Joint Program in Survey Methodology Discussion of diagnostics for linear regression models have become indispensable chapters or sections in most of the statistical textbooks. However, survey literature has not given much attention to this problem. Examples from real surveys show that sometimes the inclusion and exclusion of a small number of the sampled units can greatly change the regression parameter estimates, which indicates that techniques of identifying the influential units are necessary. The goal of this research is to extend and adapt the conventional ordinary least squares influence diagnostics to complex survey data, and determine how they should be justified. We assume that an analyst is looking for a linear regression model that fits reasonably well for the bulk of the finite population and chooses to use the survey weighted regression estimator. Diagnostic statistics such as DFBETAS, DFFITS, and modified Cook’s Distance are constructed to evaluate the effect on the regression coefficients of deleting a single observation. As components of the diagnostic statistics, the estimated variances of the coefficients are obtained from design-consistent estimators which account for complex design features, e.g. clustering and stratification. For survey data, sample weights, which are computed with the primary goal of estimating finite population statistics, are sources of influence besides the response variable and the predictor variables, and therefore need to be incorporated into influence measurement. The forward search method is also adapted to identify influential observations as a group when there is possible masked effect among the outlying observations. Two case studies and simulations are done in this dissertation to test the performance of the adapted diagnostic statistics. We reach the conclusion that removing the identified influential observations from the model fitting can obtain less biased estimated coefficients. The standard errors of the coefficients may be underestimated since the variation in the number of observations used in the regressions was not accounted for. REGRESSION DIAGNOSTICS FOR COMPLEX SURVEY DATA: IDENTIFICATION OF INFLUENTIAL OBSERVATIONS By Jianzhu Li Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park, in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2007 Advisory Committee: Dr. Richard Valliant, Chair Dr. Barry Graubard Dr. Partha Lahiri Dr. Stephen Miller Dr. Paul Smith © Copyright by Jianzhu Li 2007 Dedication This Dissertation is dedicated to my parents, Ruoxin Li and Guiqin Liang, and to my daughter, Yifan Li. ii Acknowledgements I would like to express my sincerest gratitude to my advisor, Professor Richard Valliant, for his guidance and support throughout my education and my research at University of Maryland. This dissertation would not be possible without his help. I would like to thank Dr. Barry Graubard, Dr. Partha Lahiri, Dr. Stephen Miller, and Dr. Paul Smith for agreeing to be members of dissertation committee and for helping me to clarify my research problem. I wish to thank Professor Roger Tourangeau for providing me support and encouragement to strive for this degree. I would also like to thank Rupa Jethwa and Adam Kelley for their administrative and technical support, Jill Dever and other fellow members of the JPSM PhD cohort for their encouragement and help. This Dissertation is based upon work supported by the National Science Foundation under Grant No. 0617081. iii Table of Contents 0Dedication.....................................................................................................................H ...... 166iiH 1Acknowledgements............................................................................................................H 167iiiH 2TableH of Contents............................................................................................................... 168ivH 3ListH of Tables ..................................................................................................................... 169viH 4ListH of Figures.................................................................................................................. 170viiiH 5ChapterH 1: Introduction................................................................................................. 171H 61.1H Literature Review................................................................................................. 172H 71.2H Uses of Survey Data ............................................................................................. 1732H 81.3H The Subject of This Dissertation .......................................................................... 1743H 9ChapterH 2: Linear Regression Analysis ........................................................................ 175H 102.1H Traditional Linear Regression Model................................................................... 1765H 112.2H Linear Regression for Complex Survey Data....................................................... 1775H 12ChapterH 3: Identification of Single Influential Observations........................................ 1789H 133.1H Introduction .......................................................................................................... 179H 143.2H Basic Idea in Influence Assessment ................................................................... 18010H 153.3H Sources of Influence in Survey Data .................................................................. 18111H 163.4H Review of Traditional Techniques ..................................................................... 18212H 173.4.1H Leverages and Residuals........................................................................... 18312H 183.4.2H Influence on Regression Coefficients: DFBETA and DFBETAS............ 18413H 193.4.3H Influence on Fitted Values: DFFIT and DFFITS...................................... 18514H 203.4.4H Cook’s Distance........................................................................................ 18614H 213.5H Variance Estimation Methods for Complex Survey Data .................................. 18716H 223.5.1H Asymptotic Framework ............................................................................ 18817H 233.5.2H Variance Estimation for Single-Stage Sampling With Replacement ....... 18919H 243.5.3H Variance Estimation for Multistage Sampling Design ............................. 19023H 253.6H Adaptations of Traditional Techniques to Regression on Complex Survey Data.. ............................................................................................................................ 19131H 263.6.1H Residuals and Leverages........................................................................... 19231H 273.6.2H DFBETAS................................................................................................. 19338H 283.6.3H DFFITS ..................................................................................................... 19441H 293.6.4H Distance Measure (Extended and Modified Cook’s Distance)................. 19543H 303.6.5H Discussion................................................................................................. 19648H 31ChapterH 4: Identification of Influential Groups of Observations................................ 19750H 324.1H Multiple-Case Deletion....................................................................................... 19850H 334.2H Deletion of Specific Characteristic Groups ........................................................ 19955H 344.3H Forward Search................................................................................................... 20056H 354.3.1H Introduction............................................................................................... 20156H 36ChapterH 5: Application of Diagnostic Techniques for Influence Analysis................. 20261H 375.1H Introduction ........................................................................................................ 20361H 385.2H Identifying Single Influential Observations: Case Study 1 ................................ 20462H 395.2.1H Summary of SMHO Data Set ................................................................... 20562H iv 405.2.2H Parameter Estimation................................................................................ 20664H 415.2.3H Diagnostics by Leverages and Residuals.................................................. 20765H 425.2.4H Diagnostics by DFBETAS........................................................................ 20870H 435.2.5H Diagnostics by DFFITS and Modified Cook’s Distance .......................... 20976H 445.2.6H Discussion................................................................................................. 21079H 455.3H Identifying Single Influential Observations: Case Study 2 ................................ 21180H 465.3.1H Summary of NHANES Data Set............................................................... 21280H 475.3.2H Diagnostic Results.................................................................................... 21382H 485.4H Simulation..........................................................................................................