Linear Regression

Linear Regression Blythe Durbin-Johnson, Ph.D. April 2017 We are video recording this seminar so please hold questions until the end. Thanks When to Use Linear Regression •Continuous outcome variable •Continuous or categorical predictors *Need at least one continuous predictor for name “regression” to apply When NOT to Use Linear Regression • Binary outcomes • Count outcomes • Unordered categorical outcomes • Ordered categorical outcomes with few (<7) levels Generalized linear models and other special methods exist for these settings Some Interchangeable Terms • Outcome • Predictor • Response • Covariate • Dependent Variable • Independent Variable Simple Linear Regression Simple Linear Regression • Model outcome Y by one continuous predictor X: 푌 = 훽0+ 훽1X + ε • ε is a normally distributed (Gaussian) error term Model Assumptions • Normally distributed residuals ε • Error variance is the same for all observations • Y is linearly related to X • Y observations are not correlated with each other • X is treated as fixed, no distributional assumptions • Covariates do not need to be normally distributed! A Simple Linear Regression Example Data from Lewis and Taylor (1967) via http://support.sas.com/documentation/cdl/en/statug/68162/HTML/default/viewer.htm#statug_reg_examples03.htm Goal: Find straight line that minimizes sum of squared distances from actual weight to fitted line “Least squares fit” A Simple Linear Regression Example—SAS Code proc reg data = Children; model Weight = Height; run; Children is a SAS dataset including variables Weight and Height Simple Linear Regression Example—SAS Output S.E. of slope and intercept Parameter estimates divided by S.E. Parameter Estimates P-Values Intercept: Estimated weight for Parameter Standard child of height 0 Variable DF Estimate Error t Value Pr > |t| (Not always interpretable…) Intercept 1 -143.02692 32.27459 -4.43 0.0004 Height 1 3.89903 0.51609 7.55 <.0001 Slope: How much weight increases for a 1 inch increase in height Weight increases Weight = -143.0 + 3.9*Height significantly with height Simple Linear Regression Example—SAS Output Sum of squared differences between model fit and mean of Y Sum of squares/df Analysis of Variance Mean Square(Model)/MSE Sum of Mean F Source DF Squares Square Value Pr > F Model 1 7193.24912 7193.24912 57.08 <.0001 Error 17 2142.48772 126.02869 Corrected 18 9335.73684 Total Regression on X provides a significantly Sum of squared differences between model fit and observed values of Y better fit to Y than the null (intercept-only) model Sum of squared differences between mean of Y and observed values of Y Simple Linear Regression Example—SAS Output Percent of variance of Y explained by regression Root MSE 11.22625 R-Square 0.7705 Dependent Mean 100.02632 Adj R-Sq 0.7570 Mean of Y Coeff Var 11.22330 Root MSE/mean Version of R-square adjusted of Y for number of predictors in model Thoughts on R-Squared • For our model, R-square is 0.7705 • 77% of the variability in weight is explained by height • Not a measure of goodness of fit of the model: • If variance is high, will be low even with the “right” model • Can be high with “wrong” model (e.g. Y isn’t linear in X) • See http://data.library.virginia.edu/is-r-squared-useless/ • Always gets higher when you add more predictors • Adjusted R-square intended to correct for this • Take with a grain of salt Simple Linear Regression Example—SAS Output Fit Diagnostics for Weight 20 2 2 10 1 1 t t l n n a e e u d d d i 0 u 0 u 0 s t t e S S R R R -10 -1 -1 -20 -2 -2 60 80 100 120 140 60 80 100 120 140 0.05 0.15 0.25 Predicted Value Predicted Value Leverage 20 140 0.25 10 0.20 l 120 t D a h u s ' g 0.15 d i k i 0 e 100 s o e o W R 0.10 C -10 80 0.05 60 -20 0.00 -2 -1 0 1 2 60 80 100 120 140 0 5 10 15 20 Quantile Predicted Value Observation 30 Fit–Mean Residual 25 40 t 20 20 Observations 19 n e Parameters 2 c r 15 e 0 Error DF 17 P 10 MSE 126.03 -20 5 R-Square 0.7705 -40 Adj R-Square 0.757 0 -32 -16 0 16 32 0.0 0.4 0.8 0.0 0.4 0.8 Residual Proportion Less Fit Diagnostics for Weight 20 2 2 • Residuals should form even 10 band1 around 0 1 t t l n n a e e u d d d i 0 u 0 u 0 s t t e • Size of residuals shouldn’t S S R R change with predicted value R -10 -1 -1 -20 • Sign-2 of residuals shouldn’t -2 60 80 100 120 140 change60 with80 predicted100 1 2value0 1 40 0.05 0.15 0.25 Predicted Value Predicted Value Leverage 20 140 0.25 10 0.20 l 120 t D a h u s ' g 0.15 d i k i 0 e 100 o s e o W R 0.10 C -10 80 0.05 60 -20 0.00 -2 -1 0 1 2 60 80 100 120 140 0 5 10 15 20 Quantile Predicted Value Observation 30 Fit–Mean Residual 25 40 t 20 20 Observations 19 n e Parameters 2 c r 15 e 0 Error DF 17 P 10 MSE 126.03 -20 5 R-Square 0.7705 -40 Adj R-Square 0.757 0 -32 -16 0 16 32 0.0 0.4 0.8 0.0 0.4 0.8 Residual Proportion Less 60 40 Suggests Y and X 20 have a nonlinear Residuals 0 relationship -20 -40 0 100 200 300 Fitted Values 80 60 Suggests data 40 Residuals transformation 20 0 2 4 6 8 Fitted Values Fit Diagnostics for Weight 20 2 2 10 1 1 t t l a n n e e u d d d i 0 u 0 u 0 s t t e S S R R R -10 -1 -1 -20 -2 -2 60 80 100 120 140 60 80 100 120 140 0.05 0.15 0.25 Predicted Value Predicted Value Leverage • Plot of model 20 0.25 1residuals40 versus 10 quantiles of a 0.20 l 120 t D a normal distribution h u s ' g 0.15 d i k i 0 e 1 00 s o e o W R 0.10 • Deviations from C -10 diagonal80 line suggest 0.05 departures60 from -20 normality 0.00 -2 -1 0 1 2 60 80 100 120 140 0 5 10 15 20 Quantile Predicted Value Observation 30 Fit–Mean Residual 25 40 t 20 20 Observations 19 n e Parameters 2 c r 15 e 0 Error DF 17 P 10 MSE 126.03 -20 5 R-Square 0.7705 -40 Adj R-Square 0.757 0 -32 -16 0 16 32 0.0 0.4 0.8 0.0 0.4 0.8 Residual Proportion Less Normal Q-Q Plot 4 3 Suggests data 2 transformation may SampleQuantiles 1 be needed 0 -1 -2 -1 0 1 2 Theoretical Quantiles Studentized residuals Studentized (scaled) Fit Diagnostics for Weight 20 2 2 by leverage, residuals by 10 1 1 leverage > 2(p + 1)/n t t l n n predicted values a e e u d d (= 0.21) suggests d i 0 u 0 u 0 s t t (cutoff for outlier e S S R R R influential depends on n, use -10 -1 -1 observation 3.5 for n = 19 with 1 -20 -2 -2 predictor) 60 80 100 120 140 60 80 100 120 140 0.05 0.15 0.25 Predicted Value Predicted Value Leverage 20 0.25 140 Cook’s distance > 10 0.20 l 120 t D a Y by predicted values h u s ' g 0.15 4/n (= 0.21) may d i k i 0 e 100 s o e (should form even o W R 0.10 C suggest influence -10 80 band around line) 0.05 60 (cutoff of 1 also used) -20 0.00 -2 -1 0 1 2 60 80 100 120 140 0 5 10 15 20 Quantile Predicted Value Observation 30 Fit–Mean Residual 25 40 Residual-fit plot, see t 20 20 Observations 19 n e Parameters 2 Histogram of c r 15 Cleveland, Visualizing e 0 Error DF 17 P residuals (look for 10 MSE 126.03 -20 Data (1993) skewness, other 5 R-Square 0.7705 -40 Adj R-Square 0.757 departures from 0 -32 -16 0 16 32 0.0 0.4 0.8 0.0 0.4 0.8 normality) Residual Proportion Less Thoughts on Outliers • An outlier is NOT a point that fails to support the study hypothesis • Removing data can introduce biases • Check for outlying values in X and Y before fitting model, not after • Is there another model that fits better? Do you need a nonlinear model or data transformation? • Was there an error in data collection? • Robust regression is an alternative Multiple Linear Regression A Multiple Linear Regression Example—SAS Code proc reg data = Children; model Weight = Height Age; run; A Multiple Linear Regression Example—SAS Output Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 -141.22376 33.38309 -4.23 0.0006 Height 1 3.59703 0.90546 3.97 0.0011 Age 1 1.27839 3.11010 0.41 0.6865 Adjusting for age, weight still increases significantly with height (P = 0.0011).

Linear Regression

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support