Session 1 Regression Analysis Basics

Copyright © 2012 by Statistical Innovations Inc. All rights reserved. Session 1 Regression Analysis Basics Goals: 1. Review regression basics 2. Introduce issues associated with overfitting data 3. Show how M-fold cross-validation can be used to reduce overfitting Note: While parts A and B provide a basic review and may be skipped because such knowledge is assumed as part of the prerequisites, it also serves to introduce notation that will be used throughout this course. Session Outline: A. Primary types of regression – linear, logistic, linear discriminant analysis (LDA) B. Prediction vs. classification 1. Assessing model performance 2. R2 for linear regression 3. Classification Table and ROC Curve for dichotomous dependent variable 4. Accuracy and AUC C. Examples with simulated data 1. Linear regression 2. Logistic regression / LDA (see Technical Appendix) 3. Raw vs. standardized coefficients measures of predictor importance 4. Problems with stepwise regression (These problems become worse as # predictors P approaches, and exceeds, sample size N) D. Cross-validation 1. Assessing/Validating model performance with and without cut-points 2. Holdout samples and M-fold cross-validation 3. Generalizability and R2 4. Repeated rounds of M-folds “Session 1 Technical Appendix.ppt” Technical Appendix PowerPoint: Logistic Regression and Linear Discriminant Analysis (LDA) 1 Copyright © 2012 by Statistical Innovations Inc. All rights reserved. A. Notation and basics for primary types of regression – linear, logistic, linear discriminant analysis (LDA) Regression analysis predicts a dependent variable as a function f of one or more predictor variables X = (X1, X2,…,XP), f(x) expressing the conditional expectation of the dependent variable given the values of the predictors. The scale type of the dependent variable (continuous vs. categorical) determines whether f expresses the linear or logistic regression model. In Session 1 we will limit our discussion to 1) linear regression models where the dependent variable, denoted by Y, is continuous and 2) logistic regression models where the dependent variable, denoted by Z, is dichotomous (Z = 1, or Z = 0). We will begin with linear regression. Linear Regression Basics A linear regression model expresses the conditional expectation of Y given X, ‘E(Y|X=x’), as a linear function of X. Each sample observation is assumed to be generated by an underlying process described by the linear model. For example, the linear regression for two predictors, X = (X1, X2), is: YXX 1 1 2 2 (1.1) where the random error is assumed to be normally distributed and E( | X x ) 0 for all X. By taking the expected value of both sides, Eq. (1.1) can be expressed in the equivalent form: E( Y | X x ) 1 X 1 2 X 2 (1.2) ˆ ˆ B = (ˆ , 1 , 2 )′ are estimates obtained for the model parameters, which are used to get predicted values of Y for each sample observation i=1,2,…,N. For example, the prediction for case i having values for the predictors x1i and x2i, is obtained as: ˆ ˆˆ Yiˆ 1 x 1 i 2 x 2 i (1.3) Taking X to be the N x (P+1) matrix formed by concatenating an N x 1vector of 1’s with the values of the P predictor variables observed in a sample of size N, we have X = (1 x1 x2) so that Eq. (1.3) can be expressed in matrix notation as Yˆ XB , and a vector of residuals e = (ei) is defined as e Y Yˆ . In linear regression, parameter estimates are obtained using the method of N 2 Ordinary Least Squares (OLS), which minimizes the sum of squares of the residuals ei . i1 2 Copyright © 2012 by Statistical Innovations Inc. All rights reserved. Prediction vs. Description/Explanation Prediction focuses on Yˆ and its properties, while description or explanation focuses on B and its properties. Our focus here will primarily be on prediction. In linear regression, the primary measure of goodness of prediction is given by the R2 statistic, which can be computed as the squared correlation between the observed and predicted Y values. R2 can be interpreted as the percentage of variance of Y that is explained by X. Minimizing the sum of squared errors is equivalent to maximizing R2. Under the linear regression assumptions where Eq. (1.1) reflects the true population model, OLS estimates have the following optimal properties: 1) Unbiasedness 2) BLUE (Best Linear Unbiased Estimator) – Among all unbiased estimators, OLS has the smallest variance for B and Yˆ . We will see that introducing a small amount of bias may reduce the variance substantially, thus reducing prediction error. Predictors with non-zero coefficients in the true population model will be referred to as valid predictors. It is useful to distinguish between two other kinds of predictors -- 1) irrelevant predictors and 2) extraneous predictors. Irrelevant predictors are independent of (and therefore uncorrelated with) each of the valid predictors and also with the dependent variable. Extraneous predictors are correlated with the dependent variable and at least one valid predictor, but when included as an additional predictor with the valid predictors, its population regression coefficient is 0. Venn diagrams provide a useful visual representation of different types of predictors. For example, Figure 1A (upper left hand corner) shows a typical regression with two valid predictors, X1 and X2. Suppose that the dependent variable Y = amount of alimony for child support paid by a respondent, the valid predictor X1 = the respondent’s income, and further suppose that X3 (= the respondent’s current spouse’s income) is independent of both Y, X1 and a second valid predictor, X2. This would then be an example where predictor X3 is irrelevant, and is depicted in Figure 1B (upper right hand corner). Now, define household income as X4 = X1 + X3. Since X3 is irrelevant, X4 is extraneous assuming that X1 is also included in the regression. This is depicted in Figure 1C (lower left hand corner). Valid predictors may also include suppressor variables. In Figure 1D (lower-right hand corner of 4-square), X5 serves as a suppressor variable. Session 3 describes the important role of suppressors in greater detail. For an introductory example to suppressor variables, see: http://www.statisticalinnovations.com/technicalsupport/Suppressor.AMSTAT.pdf 3 Copyright © 2012 by Statistical Innovations Inc. All rights reserved. Fig. 1A: X2 is a valid predictor Fig. 1B: X3 is an irrelevant predictor Y Y X2 X1 X2 X1 X3 Fig. 1C: X4 is an extraneous predictor Fig. 1D: X5 is a (classical) suppressor variable Y Y X1 X2 X1 X2 X5 X 4 4 Copyright © 2012 by Statistical Innovations Inc. All rights reserved. In practice, the valid predictors will not generally be known. When one or more valid predictors are omitted from the regression, one or more extraneous variables may serve a useful purpose as an imperfect proxy for that omitted predictor, and hence their inclusion can improve prediction. For example, in Figure 1C if X1 were not available for inclusion as a predictor, the extraneous variable X4 would no longer be extraneous. In this case, X4 together with X2 , in a 2-predictor model would improve prediction over the model with X2 as the sole predictor. In contrast, inclusion of irrelevant variables in the regression will improve prediction only for cases in the analysis sample, prediction of new cases outside the analysis sample deteriorating due to an increase in the generalization error. This latter result will be illustrated in part C using simulated data. B. Prediction vs. Classification Now suppose that the dependent variable Z is dichotomous. For simplicity, we will assign scores of 1 and 0 to its categories; hence, the conditional expectation of Z given X corresponds to the conditional probability that Z = 1 given X: E(Z|X) = prob(Z=1|X)*1 + prob(Z=0|X)*0 = prob(Z=1|X) Thus, a regression model used to predict Z yields predicted probabilities for Z=1. Note that this also yields predicted probabilities for Z=0 since prob(Z=0|X) can always be computed from prob(Z=1|X) as ‘1 – prob(Z=1|X)’. Since the dependent variable is categorical (in fact, here it is dichotomous), the regression model can not only be used as a predictive model, but also for classification. For a given case with X = xi, if the predicted probability that Z=1 is found to exceed .5, the outcome for that case is predicted to be in (classified into) category Z=1. Otherwise, the case is classified into outcome category Z=0. That is, the case will be assigned to category Z=1 if the predicted probability for Z=1 exceeds the predicted probability for Z=0. While R2 is the primary statistic to assess prediction in the case of linear regression, accuracy (ACC) and the Area Under the ROC Curve (AUC) are the corresponding statistics used for the logistic regression model, which is the primary model used to predict a dichotomous dependent variable. For details on the ACC and AUC statistics as well as the logistic regression model and its relationship to linear discriminant analysis, see the Session 1 Technical Appendix PowerPoint: “Session 1 Technical Appendix.ppt” 5 Copyright © 2012 by Statistical Innovations Inc. All rights reserved. C. Examples with Simulated Data We will begin with some examples based on a continuous dependent variable Y using the linear regression model. Suppose that the true population model has only a single predictor variable, RP5, and the population R2 = 0.2. In particular, the population parameters are = 0 and = 3.86 so the true model is Y = 0 + 3.86 RP5 + . Linear regression In the real world, we generally do not know the true model.

Session 1 Regression Analysis Basics

A Stepwise Regression Method and Consistent Model Selection for High-Dimensional Sparse Linear Models

Stepwise Versus Hierarchical Regression: Pros and Cons

Stepwise Model Fitting and Statistical Inference: Turning Noise Into Signal Pollution

Why Stepwise and Similar Selection Methods Are Bad, and What You Should Use Peter L

Economic Appraisal of the Commodity Vertical of Pork Market and Its Input Prices in the Czech Republic

Variable Selection Using Stepwise Regression with Application to Mytherapistmatch.Com

Backward Stepwise Regression

Model Selection in Cox Regression

Model Selection

Cross-Validation for Selecting a Model Selection Procedure 1 Introduction

Lecture 6: Multiple Linear Regression

Techniques, Stepwise Regression, Analysis of Covariance, Influence Measures, Computing Packages