Linear Regression

Linear Regression Blythe Durbin-Johnson, Ph.D. April 2017 We are video recording this seminar so please hold questions until the end. Thanks When to Use Linear Regression •Continuous outcome variable •Continuous or categorical predictors *Need at least one continuous predictor for name “regression” to apply When NOT to Use Linear Regression • Binary outcomes • Count outcomes • Unordered categorical outcomes • Ordered categorical outcomes with few (<7) levels Generalized linear models and other special methods exist for these settings Some Interchangeable Terms • Outcome • Predictor • Response • Covariate • Dependent Variable • Independent Variable Simple Linear Regression Simple Linear Regression • Model outcome Y by one continuous predictor X: 푌 = 훽0+ 훽1X + ε • ε is a normally distributed (Gaussian) error term Model Assumptions • Normally distributed residuals ε • Error variance is the same for all observations • Y is linearly related to X • Y observations are not correlated with each other • X is treated as fixed, no distributional assumptions • Covariates do not need to be normally distributed! A Simple Linear Regression Example Data from Lewis and Taylor (1967) via http://support.sas.com/documentation/cdl/en/statug/68162/HTML/default/viewer.htm#statug_reg_examples03.htm Goal: Find straight line that minimizes sum of squared distances from actual weight to fitted line “Least squares fit” A Simple Linear Regression Example—SAS Code proc reg data = Children; model Weight = Height; run; Children is a SAS dataset including variables Weight and Height Simple Linear Regression Example—SAS Output S.E. of slope and intercept Parameter estimates divided by S.E. Parameter Estimates P-Values Intercept: Estimated weight for Parameter Standard child of height 0 Variable DF Estimate Error t Value Pr > |t| (Not always interpretable…) Intercept 1 -143.02692 32.27459 -4.43 0.0004 Height 1 3.89903 0.51609 7.55 <.0001 Slope: How much weight increases for a 1 inch increase in height Weight increases Weight = -143.0 + 3.9*Height significantly with height Simple Linear Regression Example—SAS Output Sum of squared differences between model fit and mean of Y Sum of squares/df Analysis of Variance Mean Square(Model)/MSE Sum of Mean F Source DF Squares Square Value Pr > F Model 1 7193.24912 7193.24912 57.08 <.0001 Error 17 2142.48772 126.02869 Corrected 18 9335.73684 Total Regression on X provides a significantly Sum of squared differences between model fit and observed values of Y better fit to Y than the null (intercept-only) model Sum of squared differences between mean of Y and observed values of Y Simple Linear Regression Example—SAS Output Percent of variance of Y explained by regression Root MSE 11.22625 R-Square 0.7705 Dependent Mean 100.02632 Adj R-Sq 0.7570 Mean of Y Coeff Var 11.22330 Root MSE/mean Version of R-square adjusted of Y for number of predictors in model Thoughts on R-Squared • For our model, R-square is 0.7705 • 77% of the variability in weight is explained by height • Not a measure of goodness of fit of the model: • If variance is high, will be low even with the “right” model • Can be high with “wrong” model (e.g. Y isn’t linear in X) • See http://data.library.virginia.edu/is-r-squared-useless/ • Always gets higher when you add more predictors • Adjusted R-square intended to correct for this • Take with a grain of salt Simple Linear Regression Example—SAS Output Fit Diagnostics for Weight 20 2 2 10 1 1 t t l n n a e e u d d d i 0 u 0 u 0 s t t e S S R R R -10 -1 -1 -20 -2 -2 60 80 100 120 140 60 80 100 120 140 0.05 0.15 0.25 Predicted Value Predicted Value Leverage 20 140 0.25 10 0.20 l 120 t D a h u s ' g 0.15 d i k i 0 e 100 s o e o W R 0.10 C -10 80 0.05 60 -20 0.00 -2 -1 0 1 2 60 80 100 120 140 0 5 10 15 20 Quantile Predicted Value Observation 30 Fit–Mean Residual 25 40 t 20 20 Observations 19 n e Parameters 2 c r 15 e 0 Error DF 17 P 10 MSE 126.03 -20 5 R-Square 0.7705 -40 Adj R-Square 0.757 0 -32 -16 0 16 32 0.0 0.4 0.8 0.0 0.4 0.8 Residual Proportion Less Fit Diagnostics for Weight 20 2 2 • Residuals should form even 10 band1 around 0 1 t t l n n a e e u d d d i 0 u 0 u 0 s t t e • Size of residuals shouldn’t S S R R change with predicted value R -10 -1 -1 -20 • Sign-2 of residuals shouldn’t -2 60 80 100 120 140 change60 with80 predicted100 1 2value0 1 40 0.05 0.15 0.25 Predicted Value Predicted Value Leverage 20 140 0.25 10 0.20 l 120 t D a h u s ' g 0.15 d i k i 0 e 100 o s e o W R 0.10 C -10 80 0.05 60 -20 0.00 -2 -1 0 1 2 60 80 100 120 140 0 5 10 15 20 Quantile Predicted Value Observation 30 Fit–Mean Residual 25 40 t 20 20 Observations 19 n e Parameters 2 c r 15 e 0 Error DF 17 P 10 MSE 126.03 -20 5 R-Square 0.7705 -40 Adj R-Square 0.757 0 -32 -16 0 16 32 0.0 0.4 0.8 0.0 0.4 0.8 Residual Proportion Less 60 40 Suggests Y and X 20 have a nonlinear Residuals 0 relationship -20 -40 0 100 200 300 Fitted Values 80 60 Suggests data 40 Residuals transformation 20 0 2 4 6 8 Fitted Values Fit Diagnostics for Weight 20 2 2 10 1 1 t t l a n n e e u d d d i 0 u 0 u 0 s t t e S S R R R -10 -1 -1 -20 -2 -2 60 80 100 120 140 60 80 100 120 140 0.05 0.15 0.25 Predicted Value Predicted Value Leverage • Plot of model 20 0.25 1residuals40 versus 10 quantiles of a 0.20 l 120 t D a normal distribution h u s ' g 0.15 d i k i 0 e 1 00 s o e o W R 0.10 • Deviations from C -10 diagonal80 line suggest 0.05 departures60 from -20 normality 0.00 -2 -1 0 1 2 60 80 100 120 140 0 5 10 15 20 Quantile Predicted Value Observation 30 Fit–Mean Residual 25 40 t 20 20 Observations 19 n e Parameters 2 c r 15 e 0 Error DF 17 P 10 MSE 126.03 -20 5 R-Square 0.7705 -40 Adj R-Square 0.757 0 -32 -16 0 16 32 0.0 0.4 0.8 0.0 0.4 0.8 Residual Proportion Less Normal Q-Q Plot 4 3 Suggests data 2 transformation may SampleQuantiles 1 be needed 0 -1 -2 -1 0 1 2 Theoretical Quantiles Studentized residuals Studentized (scaled) Fit Diagnostics for Weight 20 2 2 by leverage, residuals by 10 1 1 leverage > 2(p + 1)/n t t l n n predicted values a e e u d d (= 0.21) suggests d i 0 u 0 u 0 s t t (cutoff for outlier e S S R R R influential depends on n, use -10 -1 -1 observation 3.5 for n = 19 with 1 -20 -2 -2 predictor) 60 80 100 120 140 60 80 100 120 140 0.05 0.15 0.25 Predicted Value Predicted Value Leverage 20 0.25 140 Cook’s distance > 10 0.20 l 120 t D a Y by predicted values h u s ' g 0.15 4/n (= 0.21) may d i k i 0 e 100 s o e (should form even o W R 0.10 C suggest influence -10 80 band around line) 0.05 60 (cutoff of 1 also used) -20 0.00 -2 -1 0 1 2 60 80 100 120 140 0 5 10 15 20 Quantile Predicted Value Observation 30 Fit–Mean Residual 25 40 Residual-fit plot, see t 20 20 Observations 19 n e Parameters 2 Histogram of c r 15 Cleveland, Visualizing e 0 Error DF 17 P residuals (look for 10 MSE 126.03 -20 Data (1993) skewness, other 5 R-Square 0.7705 -40 Adj R-Square 0.757 departures from 0 -32 -16 0 16 32 0.0 0.4 0.8 0.0 0.4 0.8 normality) Residual Proportion Less Thoughts on Outliers • An outlier is NOT a point that fails to support the study hypothesis • Removing data can introduce biases • Check for outlying values in X and Y before fitting model, not after • Is there another model that fits better? Do you need a nonlinear model or data transformation? • Was there an error in data collection? • Robust regression is an alternative Multiple Linear Regression A Multiple Linear Regression Example—SAS Code proc reg data = Children; model Weight = Height Age; run; A Multiple Linear Regression Example—SAS Output Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 -141.22376 33.38309 -4.23 0.0006 Height 1 3.59703 0.90546 3.97 0.0011 Age 1 1.27839 3.11010 0.41 0.6865 Adjusting for age, weight still increases significantly with height (P = 0.0011).

Linear Regression

Regression Summary Project Analysis for Today Review Questions: Categorical Variables

Using Survey Data Author: Jen Buckley and Sarah King-Hele Updated: August 2015 Version: 1

Analysis of Variance with Categorical and Continuous Factors: Beware the Landmines R. C. Gardner Department of Psychology Someti

A Computationally Efficient Method for Selecting a Split Questionnaire Design

Describing Data: the Big Picture Descriptive Statistics Community

Collecting Data

Questionnaire Analysis Using SPSS

Which Statistical Test

Getting Started with the Public Use File: 2015 and Beyond

General Linear Models (GLM) for Fixed Factors

How Do We Summarize One Categorical Variable? What Visual

A Package for Covariate-Constrained Randomization and the Clustered Permutation Test for Cluster Randomized Trials by Hengshi Yu, Fan Li, John A