Simple Linear Regression

San JoséState University Math 261A: Regression Theory & Methods Simple Linear Regression Dr. Guangliang Chen This lecture is based on the following textbook sections: Chapter 2: 2.1 - 2.6 • Outline of this presentation: The simple linear regression problem • Least-square estimation • Inference • Simple Linear Regression The simple linear regression problem Consider the following (population) regression model y = β0 + β1x + β0 + β1x b y where x: predictor (fixed) • x y: response (random) • β0: intercept, β1: slope : random error/noise • Dr. Guangliang Chen | Mathematics & Statistics, San José State University3/70 Simple Linear Regression Sample regression model Given a set of locations x1, . , xn, y = β0 + β1x let the corresponding responses be yi = β0 + β1xi + i, i = 1, . , n where the errors i have mean 0 and variance σ2: 2 E(i) = 0, Var(i) = σ , and additionally are uncorrelated: x1 xi xn Cov(i, j) = 0, i 6= j Dr. Guangliang Chen | Mathematics & Statistics, San José State University4/70 Simple Linear Regression In those same locations, let the ob- y servations of the responses also be b y , . , y (this is an abuse of nota- b 1 n yi b b tion) such that we have a data set b b b {(x , y ) | 1 ≤ i ≤ n}. i i b b The goal is to use the sample to estimate β0, β1 in some way (so as to xi x fit a line to the data) . Remark. Depending on the context, the notation yi can denote either a random variable, or an observed value of it. Dr. Guangliang Chen | Mathematics & Statistics, San José State University5/70 Simple Linear Regression Least-squares (LS) estimation To estimate the regression coeffi- y y = βˆ0 + βˆ1x cients β0, β1, here we adopt the least b b squares criterion: yi b b b n yî b b def X 2 min S(βˆ0, βˆ1) = (yi−(βˆ0 + βˆ1xi)) b ˆ ˆ b β0,β1 i=1 | {z } yî The corresponding minimizers are xi x called least squares estimators. yi: observation, yî: fitted value Remark. Another way is to maximize the likelihood of the sample (Sec 2.11). Dr. Guangliang Chen | Mathematics & Statistics, San José State University6/70 Simple Linear Regression Notation: To solve the problem, we need to define some quantities first: 1 n 1 n x¯ = X x , y¯ = X y n i n i i=1 i=1 and n X 2 Sxx = (xi − x¯) i=1 n X Sxy = (xi − x¯)(yi − y¯) i=1 Dr. Guangliang Chen | Mathematics & Statistics, San José State University7/70 Simple Linear Regression It can be shown that n X 2 2 Sxx = xi − nx¯ , i=1 n X Sxy = xiyi − nx¯y¯ i=1 Verify: Dr. Guangliang Chen | Mathematics & Statistics, San José State University8/70 Simple Linear Regression Theorem 0.1. The LS estimators of the intercept and slope in the simple linear regression model are Sxy βˆ0 =y ¯ − βˆ1x,¯ βˆ1 = Sxx Proof. Taking partial derivatives of n X 2 S(βˆ0, βˆ1) = (yi − βˆ0 − βˆ1xi) i=1 Dr. Guangliang Chen | Mathematics & Statistics, San José State University9/70 Simple Linear Regression and setting them to zero gives that ∂S n = −2 X(y − βˆ − βˆ x ) = 0 ˆ i 0 1 i ∂β0 i=1 ∂S n = −2 X(y − βˆ − βˆ x ) x = 0 ˆ i 0 1 i i ∂β1 i=1 which can then be simplified to X X yi = nβˆ0 + βˆ1 xi X ˆ X ˆ X 2 xiyi = β0 xi + β1 xi The first equation can be rewritten as y¯ = βˆ0 + βˆ1x¯ Dr. Guangliang Chen | Mathematics & Statistics, San José State University 10/70 Simple Linear Regression from which we obtain that βˆ0 =y ¯ − βˆ1x¯ Plugging it into the second equation yields that X ˆ ˆ X 2 xiyi = (¯y − β1x¯) nx¯ + β1 xi and further that X ˆ X 2 2 xiyi − nx¯y¯ = β1 xi − nx¯ | {z } | {z } Sxy Sxx This thus completes the proof. Dr. Guangliang Chen | Mathematics & Statistics, San José State University 11/70 Simple Linear Regression Remark. We make the following observations: The LS regression line always passes through the centroid (¯x, y¯) of • the data: y¯ = βˆ0 + βˆ1x¯. Dr. Guangliang Chen | Mathematics & Statistics, San José State University 12/70 Simple Linear Regression Alternative forms of the equation of the LS regression line are • y = (¯y − βˆ1x¯) +βˆ1x =y ¯ + βˆ1(x − x¯) | {z } βˆ0 To study the effect of different samples on the regression coefficients, we regard the yi as random variables (in this case y,¯ βˆ0, βˆ1 are also random variables). It can be shown that (homework problem: 2.25) 2 x¯ Cov y,¯ βˆ1 = 0, Cov βˆ0, βˆ1 = −σ Sxx That is, y,¯ βˆ1 are uncorrelated, but βˆ0, βˆ1 are not. Dr. Guangliang Chen | Mathematics & Statistics, San José State University 13/70 Simple Linear Regression The residuals of the model are • ei = yi − yî = yi − βˆ0 + βˆ1xi = yi − y¯ + βˆ1(xi − x¯) . P P P ei = 0. This implies that yi = yî, and thus {yî} and {yi} • have the same mean. Proof : Dr. Guangliang Chen | Mathematics & Statistics, San José State University 14/70 Simple Linear Regression P P xiei = 0, and yîei = 0 • Proof : Dr. Guangliang Chen | Mathematics & Statistics, San José State University 15/70 Simple Linear Regression Example 0.1 (Toy data). Given a data set of 3 points: (0, 1), (1, 0), (2, 2), find the least-squares regression line. y b 2 1 b x b 1 2 Dr. Guangliang Chen | Mathematics & Statistics, San José State University 16/70 Simple Linear Regression Solution. First, x¯ = 1 =y ¯, and X 2 2 X Sxx = xi − nx¯ = 5 − 3 = 2,Sxy = xiyi − nx¯y¯ = 4 − 3 = 1. It follows that Sxy 1 1 βˆ1 = = , βˆ0 =y ¯ − βˆ1x¯ = . Sxx 2 2 Thus, the regression line is given by 1 1 y = βˆ + βˆ x = + x. 0 1 2 2 The fitted values and their residuals are 1 3 1 1 yˆ = , yˆ = 1, yˆ = and e = , e = −1, e = 1 2 2 3 2 1 2 2 3 2 Dr. Guangliang Chen | Mathematics & Statistics, San José State University 17/70 Simple Linear Regression Example 0.2 (R demonstration). Consider the dataset that contains weights and heights of 507 physically active individuals (247 men and 260 women).1 We fit a regression line of weight (y) versus height (x) by R. 1http://jse.amstat.org/v11n2/datasets.heinz.html Dr. Guangliang Chen | Mathematics & Statistics, San José State University 18/70 Simple Linear Regression y=−105.01125+1.01762x 100 80 Weight (kg) Weight 60 40 150 160 170 180 190 200 Height (cm) Dr. Guangliang Chen | Mathematics & Statistics, San José State University 19/70 Simple Linear Regression Inference in simple linear regression 2 Model parameters: β0 (intercept), β1 (slope), σ (noise variance) • Inference tasks (for each parameter above): point estimation, • interval estimation*, hypothesis testing* Inference of the mean response at any location x0: • E(y | x0) = β0 + β1x0 *To perform the last two inference tasks, we will additionally assume that the model errors i are normally and independently distributed with mean 2 iid 2 0 and variance σ , i.e., 1, . , n ∼ N(0, σ ). Dr. Guangliang Chen | Mathematics & Statistics, San José State University 20/70 Simple Linear Regression Point estimation in regression Theorem 0.2. The LS estimators βˆ0, βˆ1 are unbiased linear estimators of the model parameters β0, β1, that is, E(βˆ0) = β0, E(βˆ1) = β1 Furthermore, 2 ! 2 2 1 x¯ σ Var(βˆ0) = σ + , Var(βˆ1) = n Sxx Sxx Remark. The Gauss-Markov Theorem stats that the LS estimators βˆ0, βˆ1 are the best linear unbiased estimators in that they have the smallest possible variance (among all linear unbiased estimators of β0, β1). Dr. Guangliang Chen | Mathematics & Statistics, San José State University 21/70 Simple Linear Regression Proof. Write P Sxy (xi − x¯)yi X xi − x¯ βˆ1 = = = ciyi, ci = Sxx Sxx Sxx It follows that X X X X E(βˆ1) = ciE(yi) = ci(β0 + β1xi) = β0 ci +β1 cixi = β1 | {z } | {z } =0 =1 and 2 ˆ X 2 2 X 2 2 1 σ Var(β1) = ci Var(yi) = σ ci = σ · = | {z } Sxx Sxx =σ2 Dr. Guangliang Chen | Mathematics & Statistics, San José State University 22/70 Simple Linear Regression For βˆ0, it is unbiased for estimating β0 because E(βˆ0) = E(¯y − βˆ1x¯) = E(¯y) − E(βˆ1)¯x = (β0 + β1x¯) − β1x¯ = β0. Using the formula Var(X − Y ) = Var(X) + Var(Y ) − 2Cov(X, Y ), we obtain that Var(βˆ0) = Var(¯y) + Var(βˆ1x¯) − 2 Cov(¯y, βˆ1x¯) 1 X 2 ˆ ˆ = 2 Var(yi) +x ¯ Var(β1) − 2¯x Cov(¯y, β1) n | {z } =0 2 2 ! 1 2 2 σ 2 1 x¯ = 2 nσ +x ¯ = σ + . n Sxx n Sxx Dr. Guangliang Chen | Mathematics & Statistics, San José State University 23/70 Simple Linear Regression To estimate the noise variance σ2, we need to define Total Sum of Squares • n X 2 SST = (yi − y¯) i=1 Regression Sum of Squares • n X 2 SSR = (ˆyi − y¯) i=1 Residual Sum of Squares • n n X 2 X 2 SSRes = ei = (yi − yî) i=1 i=1 Dr. Guangliang Chen | Mathematics & Statistics, San José State University 24/70 Simple Linear Regression Dr. Guangliang Chen | Mathematics & Statistics, San José State University 25/70 Simple Linear Regression It can be shown that SST = SSR + SSRes Proof : X 2 SST = (yi − y¯) X 2 = (yi − yî +y î − y¯) X 2 X 2 X = (yi − yî) + (ˆyi − y¯) + 2 (yi − yî)(ˆyi − y¯) X = SSRes + SSR + 2 eiyî | {z } =0 Dr.

Load more