Ordinary Least Squares Regression and Regression Diagnostics University of Virginia Charlottesville, VA
Total Page:16
File Type:pdf, Size:1020Kb
Ordinary Least Squares Regression And Regression Diagnostics University of Virginia Charlottesville, VA. April 20, 2001 Jim Patrie. Department of Health Evaluation Sciences Division of Biostatistics and Epidemiology University of Virginia Charlottesville, VA jpatrie@virginia.edu Presentation Outline I ) Overview of Regression Analysis. II) Different Types of Regression Analysis. III) The General Linear Model. IV) Ordinary Least Squares Regression Parameter Estimation. V) Statistical Inference for the OLS Regression Model. VI) Overview of the Model Building Process. VII) An Example Case Study. Introduction The term “regression analysis” describes a collection of statistical techniques which serve as the basis for drawing inference as to whether or not a relationship exists between two or more quantities within a system, or within a population. More specifically, regression analysis is a method to quantitatively characterize the relationship between a response variable Y, which is assumed to be random, and one or more explanatory variables (X), which are generally assumed to have values that are fixed. I) Regression analysis is typically utilized for one of the following purposes. • Description To assess whether or not a response variable, or perhaps a function of the response variable, is associated with one or more independent variables. • Control To control for secondary factors which may influence the response variable, but are not consider as the primary explanatory variables of interest. • Prediction To predict the value of the response variable at specific values of the explanatory variables. II) Types of Regression Analysis • General Linear Regression.* • Non-Linear Regression. • Robust Regression. Least median squares regression. Least absolute deviation regression. Weighted least square. • Non-Parametric Regression. • Generalized Linear Regression. Logistic regression. Log-linear regression. III). In regard to form, the general linear model is expressed as: y i = bo + b1x i,1 + b2x i,2 +L+ bp-1x i,p-1 +ei where yi is the ith response. bo ,b1,Lbp-1 are the regression. parameters. x i1,x i2,Lx i,p-1 are known constants. ei is the independen t random error associated with the ith response, typically assume to be distributed N(0, s2). In matrix notation the general linear model is expressed as: y = Xb + e where y = n x 1 vector of response values. X = n x p matrix of known constants (covariates). b = p x 1 vector of regression parameters. e = n x 1 vector of identically distributed random errors, typically assume to be distributed N(0,s2). In matrix notation, the general linear model components are: éy ù é1 x 11 x 12 Lx 1,p-1 ù 1 ê ú ê ú 1 x x x êy2 ú ê 21 22 L 2,p-1 ú y(n x 1) = X(n x p) = êM ú êM M M M M ú ê ú ê ú ëyn û ëê1 x n1 x n2 Lx n,p-1ûú ébo ù ée1 ù êb ú êe ú ê 1 ú ê 2 ú b = e(n x 1) = (p x 1) ê M ú ê M ú ê ú ê ú ëbp-1 û ëen û The Y and X Model Components The column vector y consists of random variables with a continuous scale measure, while the column vectors of X may consist of: • continuous scale measures, or functions of continuous scale measures (e.g. polynomials, splines). • binary indicators (e.g. gender). • nominal or ordinal classification variables (e.g. age class). • product terms computed between the values of two or more of the columns of X; referred to as interaction terms. Examples of Linear Models a) y = b + b x + b x + L b x - + e i o 1 i,1 2 i,2 p-1 i,p 1 i b) y = b + b x + b x 2 + b x + b x x + e i o 1 i,1 2 i,1 3 i,2 4 i,1 i,2 i c) y = b + b x + b log (x ) + b exp(x ) + e i o 1 i,1 2 e i,2 3 i,3 i * Note that in each case the value yi is linear in the model parameters. Examples of Non-Linear Models a) Exponential Model y = g + g exp( g x ) + e i o 1 2 i i b) Logistic Model g y = o + e i 1 + g exp( g x ) i 1 2 i c) Weibull Model d y = a - b exp(-gx ) + e i i i IV) Parameter Estimation for the Ordinary Least Squares Model. a) For the estimation of the vector b, we minimize Q n 2 Q = å(yi - b0 - b1xi1-,×××,-bp-1xi,p-1) i=1 by simultaneously solving the p normal equations. é ¶Q ù = 0 ê¶b ú ê o ú ê¶Q ú ¶Q = 0 = 0 = ê¶b ú ¶b ê 1 ú ê M ú ê ¶Q ú ê = 0ú ¶ ëê bp-1 ûú In Matrix Notation We minimize Q = (y – Xb)'(y-Xb) ¶ [(y - Xb)'(y - Xb)] = 0 ¶b 2(X'X)b= 2X'y (X'X)b =X'y b = (X¢X)-1X¢y with the resulting estimator for b expressed as: b = (X¢X)-1X¢y 2 b) Estimation of the variance of yi; symbolically expressed as s {yi}. Let e denote the vector of residuals from the model fit e(n x 1) = y - Xb = (y - yˆ). The sum of squares error (SSE) equals SSE(1 x 1) = e'e = (y - ˆy )'(y - ˆy). 2 and the estimator for s {yi} is expressed as: SSE s ˆ 2 {y } = where p = number of regression parameters. i n - p 2 Typically, sˆ {yi} is referred to as the residual MSE. c) Estimation of the variance of b; symbolically expressed as s2{b}. Since -1 b = (X'X) X'y Var(b) = [(X'X)-1X'][(X'X)-1X']'Var(y) = Var(y)[(X' X)-1(X'X)(X'X)-1] 2 -1 = s {yi}(X'X). the estimator for s2{b} is expressed as: 2 2 -1 sˆ {b} =sˆ {yi}(X'X) = MSE(X'X).-1 d) Estimation of the variance of yˆ ; symbolically expressed as s2 i {yˆi}. Since yˆi = x i,(1..p)b 2 2 s {yˆ i}= [x'i,(1..p)s {b} xi,(1..p) ] s2 the estimator for {yˆi} is expressed as: sˆ 2 = ˆ 2 -1 {yˆ i} [x'i,(1..p) s {yi}(X'X) x i,(1..p) ] sˆ 2 -1 = {yi}[x'i,(1..p) (X'X) x i,(1..p) ] -1 = MSE [x'i,(1..p) (X'X) x i,(1..p) ] V) Statistical Inference for the Least Squares Regression Model. a) ANOVA sum of squares decomposition. SS Total = SS Regression + SS Error n n 2 2 2 å(yi - y.) = å (yˆ i - y.) + (yi - yˆ i) i=1 i=1 Table 1. Regression ANOVA table. Source SS DF MS F-test Regression SSR p-1 SSR/(p-1) MSR/MSE Error SSE n-p SSE/(n-p) Total SST n-1 900 SSE 800 yi - yˆ i 700 SST 600 yi - y Y 500 400 SSR 300 yˆ i - y 200 y 100 0 0 1 2 3 4 5 6 7 8 X b) Hypothesis tests related to b Ho : bi = 0 Ha : bi ¹ 0 Test Statistic - * bi 0 t = ˆ 2 s {bi} bi - 0 * = where t ~ t(n -p) -1 MSE(X'X)i,i * If | t |< t(1- a/2;n -p), conclude Ho * If | t |³ t(1-a/2;n - p), conclude Ha c) Confidence limits for the yˆ i at a specific xi (1-a)% CL = yˆ i +/- t(1-a/2,n -p) sˆ {yˆ i} -1 = yˆ i +/- t(1-a/2,n -p) MSE(x'i (X'X) x i ) d) Prediction limits for a new yi at a specific xi. ˆ (1-a)%PL = yˆ i +/- t(1-a/2,n -p) s{ynew,i} 1 = yˆ +/- t(1-a/2,n - p) MSE(1 + + x' (X'X)-1x ) i n i i e) Simultaneous confidence band for the regression line Y=Xb. -a = +/- pF ˆ 2 (1 )%CB yˆ i (1 - a; p, n - p) s {yˆi} -1 ] (1 - a)%CB = yˆ i +/- pF(1 - a; p, n - p)MSE[x'i (X'X) xi where p = the number of regression parameters. F = the critical value of a F-distribution with (1 - a; p, n - p) p and n -p df evaluated at the (1- a)100 percentile. 900 800 700 CL 600 500 Y 400 CB 300 PL 200 100 0 -100 -200 0.5 1.5 2.5 3.5 4.5 5.5 6.5 X When to Use CL, CB, and PL. • Apply CL when your goal is to predict the expect value of yi for one specific vector of predictors xi within the range of the data. • Apply CB when your goal is to predict the expect value of yi for all vectors of predictors xi within the range of the data. • Apply PL when your goal is to predict the value of yi for one specific vector of predictors xi within the range of the data. VI) Overview of the Model-Building Process Data Data Model Collection Reduction Development Model Diagnostics Model and Interpretation Validation Refinement a) Data Collection • Controlled experiments. Covariate information consists of explanatory variables that are under the experimenter’s control. • Controlled experiments with supplemental variables. Covariate information includes supplemental variables related to the characteristics of the experimental units, in addition to the variables that are under the experimenter’s control. • Exploratory observational studies. Covariate information may include a large number of variables related to the characteristics of the observational unit, none of which are under the investigator’s control.