Statistical Modeling Through Regression Part 1. Multiple Regression
Total Page:16
File Type:pdf, Size:1020Kb
ACADEMIC YEAR 2018-19 DATA ANALYSIS Block 4: Statistical Modeling through Regression OF TRANSPORT AND LOGISTICS – Part 1. Multiple Regression MSCTL - UPC Lecturer: Lídia Montero September 2018 – Version 1.1 MASTER’S DEGREE IN SUPPLY CHAIN, TRANSPORT AND LOGISTICS DATA ANALYSIS OF TRANSPORT AND LOGISTICS – MSCTL - UPC CONTENTS 4.1-1. READINGS____________________________________________________________________________________________________________ 3 4.1-2. INTRODUCTION TO LINEAR MODELS _________________________________________________________________________________ 4 4.1-3. LEAST SQUARES ESTIMATION IN MULTIPLE REGRESSION ____________________________________________________________ 12 4.1-3.1 GEOMETRIC PROPERTIES _______________________________________________________________________________________________ 14 4.1-4. LEAST SQUARES ESTIMATION: INFERENCE __________________________________________________________________________ 16 4.1-4.1 BASIC INFERENCE PROPERTIES __________________________________________________________________________________________ 16 4.1-5. HYPOTHESIS TESTS IN MULTIPLE REGRESSION ______________________________________________________________________ 19 4.1-5.1 TESTING IN R ________________________________________________________________________________________________________ 21 4.1-5.2 CONFIDENCE INTERVAL FOR MODEL PARAMETERS __________________________________________________________________________ 22 4.1-6. MULTIPLE CORRELATION COEFFICIENT_____________________________________________________________________________ 23 4.1-6.1 PROPERTIES OF THE MULTIPLE CORRELATION COEFFICIENT ________________________________________________________________ 25 4.1-6.2 R2-ADJUSTED ________________________________________________________________________________________________________ 25 4.1-7. GLOBAL TEST FOR REGRESSION. ANOVA TABLE _____________________________________________________________________ 26 4.1-8. PREDICTIONS AND INFERENCE ______________________________________________________________________________________ 27 4.1-9. MODEL VALIDATION ________________________________________________________________________________________________ 28 4.1-10. MODEL TRANSFORMATIONS ON Y OR X ______________________________________________________________________________ 37 4.1-10.1 BOX-COX TRANSFORMATION FOR Y_____________________________________________________________________________________ 38 4.1-11. MODEL VALIDATION: UNUSUAL AND INFLUENTIAL DATA ____________________________________________________________ 40 4.1-11.1 A PRIORI INFLUENTIAL OBSERVATIONS __________________________________________________________________________________ 44 4.1-11.2 A POSTERIORI INFLUENTIAL DATA ______________________________________________________________________________________ 46 4.1-12. BEST MODEL SELECTION ____________________________________________________________________________________________ 50 4.1-12.1 STEPWISE REGRESSION _______________________________________________________________________________________________ 52 Prof. Lídia Montero © Page. 4.1-2 Course 2.018-2.019 DATA ANALYSIS OF TRANSPORT AND LOGISTICS – MSCTL - UPC 4.1-1. READINGS Basic References: S.P. Washington, M.G. Karlaftis and F.L. Mannering (2011) Statistical and Econometric methods for transportation data analysis, 2nd Edition, Chapman and Hall/CRC.ISBN: 978-1-4200-8285-2. P.Dalgaard (2008) Introductory Statistics with R. Springer. Fox, J.: Applied Regression Analysis and Generalized Linear Models. Sage Publications, 2015. John Fox and Sanford Weisberg (2011) An R Companion to Applied Regression Second Edition SAGE Publications, Inc ISBN: 9781412975148. Prof. Lídia Montero © Page. 4.1-3 Course 2.018-2.019 DATA ANALYSIS OF TRANSPORT AND LOGISTICS – MSCTL - UPC 4.1-2. INTRODUCTION TO LINEAR MODELS T Let y = (y1 ,, yn ) be a vector of n observations considered a draw of the random vector T Y = (Y1 ,,Yn ) , whose variables are statistically independent and distributed with expectation T µ = (µ1,,µn ) : T In linear models, the Random Component distribution for Y = (Y1 ,,Yn ) is assumed normal with 2 constant variance σ and expectation Ε[Y] = µ . So, the response variable is modeled as normal distributed, thus negative or positive values, arbitrary small or large can be encountered as data for the response and prediction. The Systematic component of the model consists on specifying a vector called the linear predictor, notated η , of the same length as the response, dimension n, obtained from the linear combination T of regressors (explicative variables) . In vector notation, parameters are β = (β1 ,, β p ) and regressors are X = (X 1,, X p ) and thus η = X β where η is nx1, X is nxp and β is px1. η Vector µ is directly the linear predictor , thus the link function is η = µ . Prof. Lídia Montero © Page. 4.1-4 Course 2.018-2.019 DATA ANALYSIS OF TRANSPORT AND LOGISTICS – MSCTL - UPC 4.1-2 INTRODUCTION TO LINEAR MODELS Empirical problem: What do data say about class sizes and test scores according in The California Test Score Data Set All K-6 and K-8 California school districts (n = 420) Variables: • 5th grade test scores (Stanford-9 achievement test, combined math and reading), district average • Student-teacher ratio (STR) = no. of students in the district divided by no. full-time equivalent teachers Policy question: What is the effect of reducing class size by one student per class? by 8 students/class? Do districts with smaller classes (lower STR) have higher test scores? An initial look at the California test score data: Prof. Lídia Montero © Page. 4.1-5 Course 2.018-2.019 DATA ANALYSIS OF TRANSPORT AND LOGISTICS – MSCTL - UPC 4.1-2 INTRODUCTION TO LINEAR MODELS Stock and Watson (2007) Prof. Lídia Montero © Page. 4.1-6 Course 2.018-2.019 DATA ANALYSIS OF TRANSPORT AND LOGISTICS – MSCTL - UPC 4.1-2 SOME NOTATION AND TERMINOLOGY • The population regression line is Test Score = β1 + β2STR β1 is the intercept β2 is the slope of population regression line ∆Test score = = change in test score for a unit change in STR ∆STR • Why are β1 and β2 “population” parameters? • We would like to know the population value of β2. • We don’t know β2, so must estimate it using data. • How can we estimate β1 and β2 from data? We will focus on the least squares (“ordinary least squares” or “OLS”) estimator of the unknown parameters β1 and β2, which solves, 2 Min S(β ) = (Y− Xβ)T ⋅(Y− Xβ) = (Y − β − β x ) β=(β1 ,β2 ) ∑ k 1 2 k k Prof. Lídia Montero © Page. 4.1-7 Course 2.018-2.019 DATA ANALYSIS OF TRANSPORT AND LOGISTICS – MSCTL - UPC 4.1-2 SOME NOTATION AND TERMINOLOGY: EXAMPLE The OLS estimator minimizes the average squared difference between the actual values of Yi and the prediction (predicted value) based on the estimated line. Application to the California Test Score – Class Size data Estimated slope = = – 2.28 Estimated intercept = = 698.9 Estimated regression line: = 698.9 – 2.28 STR Interpretation of the estimated slope and intercept: • Districts with one more student per teacher on average have test scores that are 2.28 points lower. • The intercept (taken literally) means that, according to this estimated line, districts with zero students per teacher would have a (predicted) test score of 698.9. It makes no sense, since it extrapolates outside the range of the data. Prof. Lídia Montero © Page. 4.1-8 Course 2.018-2.019 DATA ANALYSIS OF TRANSPORT AND LOGISTICS – MSCTL - UPC 4.1-2 SOME NOTATION AND TERMINOLOGY: EXAMPLE Predicted values & residuals: One of the districts in the data set for which STR = 25 and Test Score = 621 predicted value: = 698.9 – 2.28*25 = 641.9 residual: = 621 – 641.9 = -20.9 The OLS regression line is an estimate, computed using our sample of data; a different sample would have given a ˆ different value of β 2 . yˆk yκ Prof. Lídia Montero © Page. 4.1-9 Course 2.018-2.019 DATA ANALYSIS OF TRANSPORT AND LOGISTICS – MSCTL - UPC 4.1-2 SOME NOTATION AND TERMINOLOGY How can we: ˆ quantify the sampling uncertainty associated with β 2 ? ˆ use β 2 to test hypotheses such as β 2 = 0? ˆ construct a confidence interval for β 2 ? We are going to proceed in four steps: • The probability framework for linear regression • Estimation • Hypothesis Testing • Confidence intervals A vector-matrix notation for regression elements will be considered since it simplifies the mathematical framework when dealing with several explicative variables (regressors) . Prof. Lídia Montero © Page. 4.1-10 Course 2.018-2.019 DATA ANALYSIS OF TRANSPORT AND LOGISTICS – MSCTL - UPC 4.1-2 INTRODUCTION TO LINEAR MODELS Classification of statistical tools for analysis and modeling Explicative Response Variable Variables Dicothomic or Polythomic Counts Continuous Binary (discrete) Normal Time between events Dicothomic Contingency tables Contingency tables Log-linear Tests for 2 Survival Logistic regression Log-linear models models subpopulation Analysis Log-linear models means: t.test Polythomic Contingency tables Contingency tables Log-linear ONEWAY, Survival Logistic regression Log-linear models models ANOVA Analysis Log-linear models Continuous Logistic regression * Log-linear Multiple Survival (covariates) models regression Analysis Factors and Logistic regression * Log-linear Covariance Survival covariates models Analysis Analysis Random Mixed models Mixed models Mixed Mixed models Mixed models Effects models Prof. Lídia Montero © Page. 4.1-11 Course 2.018-2.019 DATA ANALYSIS OF TRANSPORT AND LOGISTICS – MSCTL - UPC 4.1-3. LEAST SQUARES ESTIMATION IN MULTIPLE REGRESSION Assume a linear