Simple Linear Regression Models, with Hints at Their Estimation

00:17 Wednesday 16th September, 2015 See updates and corrections at http://www.stat.cmu.edu/~cshalizi/mreg/ Lecture 4: Simple Linear Regression Models, with Hints at Their Estimation 36-401, Fall 2015, Section B 10 September 2015 1 The Simple Linear Regression Model Let's recall the simple linear regression model from last time. This is a statistical model with two variables X and Y , where we try to predict Y from X. The assumptions of the model are as follows: 1. The distribution of X is arbitrary (and perhaps X is even non-random). 2. If X = x, then Y = β0 + β1x + , for some constants (\coefficients", \parameters") β0 and β1, and some random noise variable . 3. E [jX = x] = 0 (no matter what x is), Var [jX = x] = σ2 (no matter what x is). 4. is uncorrelated across observations. To elaborate, with multiple data points, (X1;Y1); (X2;Y2);::: (Xn;Yn), then the model says that, for each i 2 1 : n, Yi = β0 + β1Xi + i (1) where the noise variables i all have the same expectation (0) and the same 2 variance (σ ), and Cov [i; j] = 0 (unless i = j, of course). 1.1 \Plug-In" Estimates In lecture 1, we saw that the optimal linear predictor of Y from X has slope β1 = Cov [X; Y ] =Var [X], and intercept β0 = E [Y ] − β1E [X]. A common tactic in devising estimators is to use what's sometimes called the \plug-in principle", where we find equations for the parameters which would hold if we knew the full distribution, and \plug in" the sample versions of the population quantities. We saw this in the last lecture, where we estimated β1 by the ratio of the sample covariance to the sample variance: cXY βc1 = 2 (2) sX 1 2 1.2 Least Squares Estimates We also saw, in the notes to the last lecture, that so long as the law of large numbers holds, βc1 ! β1 (3) as n ! 1. It follows easily that βc0 = Y − βc1X (4) will also converge on β0. 1.2 Least Squares Estimates An alternative way of estimating the simple linear regression model starts from the objective we are trying to reach, rather than from the formula for the slope. Recall, from lecture 1, that the true optimal slope and intercept are the ones which minimize the mean squared error: 2 (β0; β1) = argmin E (Y − (b0 + b1X)) (5) (b0;b1) This is a function of the complete distribution, so we can't get it from data, but we can approximate it with data. The in-sample, empirical or training MSE is n 1 X MSE\(b ; b ) ≡ (y − (b + b x ))2 (6) 0 1 n i 0 1 i i=1 Notice that this is a function of b0 and b1; it is also, of course, a function of the data, (x1; y1); (x2; y2);::: (xn; yn), but we will generally suppress that in our notation. If our samples are all independent, for any fixed (b0; b1), the law of large numbers tells us that MSE\(b0; b1) ! MSE(b0; b1) as n ! 1. So it doesn't seem unreasonable to try minimizing the in-sample error, which we can compute, as a proxy for minimizing the true MSE, which we can't. Where does it lead us? Start by taking the derivatives with respect to the slope and the intercept: n @MSE\ 1 X = (y − (b + b x ))(−2) (7) @b n i 0 1 i 0 i=1 n @MSE\ 1 X = (y − (b + b x ))(−2x ) (8) @b n i 0 1 i i 0 i=1 ^ ^ Set these to zero at the optimum (β0; β1): n 1 X (y − (β^ + β^ x )) = 0 (9) n i 0 1 i i=1 n 1 X (y − (β^ + β^ x ))(x ) = 0 n i 0 1 i i i=1 00:17 Wednesday 16th September, 2015 3 1.3 Bias, Variance and Standard Error of Parameter Estimates These are often called the normal equations for least-squares estimation, or the estimating equations: a system of two equations in two unknowns, whose solution gives the estimate. Many people would, at this point, remove the factor of 1=n, but I think it makes it easier to understand the next steps: ^ ^ y − β0 − β1x = 0 (10) ^ ^ 2 xy − β0x − β1x = 0 (11) The first equation, re-written, gives ^ ^ β0 = y − β1x (12) Substituting this into the remaining equation, ^ ^ 2 0 = xy − y¯x¯ + β1x¯x¯ − β1x (13) ^ 2 0 = cXY − β1sX (14) ^ cXY β1 = 2 (15) sX That is, the least-squares estimate of the slope is our old friend the plug-in estimate of the slope, and thus the least-squares intercept is also the plug-in intercept. Going forward The equivalence between the plug-in estimator and the least- squares estimator is a bit of a special case for linear models. In some non-linear models, least squares is quite feasible (though the optimum can only be found numerically, not in closed form); in others, plug-in estimates are more useful than optimization. 1.3 Bias, Variance and Standard Error of Parameter Es- timates Whether we think of it as deriving from pluging-in or from least squares, we work out some of the properties of this estimator of the coefficients, using the ^ model assumptions. We'll start with the slope, β1. 00:17 Wednesday 16th September, 2015 4 1.3 Bias, Variance and Standard Error of Parameter Estimates ^ cXY β1 = 2 (16) sX 1 Pn n i=1 xiyi − x¯y¯ = 2 (17) sX 1 Pn n i=1 xi(β0 + β1xi + 1) − x¯(β0 + β1x¯ + ¯) = 2 (18) sX 2 1 Pn 2 β0x¯ + β1x + n i=1 xii − xβ¯ 0 − β1x¯ − x¯¯ = 2 (19) sx 2 1 Pn β1sX + n i=1 xii − x¯¯ = 2 (20) sX 1 Pn n i=1 xii − x¯¯ = β1 + 2 (21) sX −1 P Sincex ¯¯ = n i x¯ i, 1 Pn (x − x¯) ^ n i=1 i i β1 = β1 + 2 (22) sX This representation of the slope estimate shows that it is equal to the true slope (β1) plus something which depends on the noise terms (the i, and their sample average ¯). We'll use this to find the expected value and the variance of ^ the estimator β1. In the next couple of paragraphs, I am going to treat the xi as non-random variables. This is appropriate in \designed" or \controlled" experiments, where we get to chose their value. In randomized experiments or in observational stud- ies, obviously the xi aren't necessarily fixed; however, these expressions will be h ^ i correct for the conditional expectation E β1jx1; : : : xn and conditional vari- h ^ i ance Var β1jx1; : : : xn , and I will come back to how we get the unconditional expectation and variance. Expected value and bias Recall that E [ijXi] = 0, so n 1 X (x − x¯) [ ] = 0 (23) n i E i i=1 Thus, h ^ i E β1 = β1 (24) Since the bias of an estimator is the difference between its expected value ^ and the truth, β1 is an unbiased estimator of the optimal slope. 00:17 Wednesday 16th September, 2015 5 1.3 Bias, Variance and Standard Error of Parameter Estimates (To repeat what I'm sure you remember from mathematical statistics: \bias" h ^ i here is a technical term, meaning no more and no less than E β1 −β1. An unbiased estimator could still make systematic mistakes | for instance, it could un- derestimate 99% of the time, provided that the 1% of the time it over-estimates, it does so by much more than it under-estimates. Moreover, unbiased estimators are not necessarily superior to biased ones: the total error depends on both the bias of the estimator and its variance, and there are many situations where you can remove lots of bias at the cost of adding a little variance. Least squares for simple linear regression happens not to be one of them, but you shouldn't expect that as a general rule.) Turning to the intercept, h ^ i h ^ i E β0 = E Y − β1X (25) h ^ i = β0 + β1X − E β1 X (26) = β0 + β1X − β1X (27) = β0 (28) so it, too, is unbiased. Variance and Standard Error Using the formula for the variance of a sum from lecture 1, and the model assumption that all the i are uncorrelated with each other, 1 Pn (x − x¯) h ^ i n i=1 i i Var β1 = Var β1 + 2 (29) sX 1 Pn n i=1 (xi − x¯)i = Var 2 (30) sX 1 Pn 2 n2 i=1 (xi − x¯) Var [i] = 2 2 (31) (sX ) σ2 2 n sX = 2 2 (32) (sX ) σ2 = 2 (33) nsX In words, this says that the variance of the slope estimate goes up as the noise around the regression line (σ2) gets bigger, and goes down as we have more observations (n), which are further spread out along the horizontal axis 2 (sX ); it should not be surprising that it's easier to work out the slope of a line from many, well-separated points on the line than from a few points smushed together. The standard error of an estimator is just its standard deviation, or the square root of its variance: σ se(β^ ) = (34) 1 p 2 nsX 00:17 Wednesday 16th September, 2015 6 1.4 Parameter Interpretation; Causality ^ I will leave working out the variance of β0 as an exercise. Unconditional-on-X Properties The last few paragraphs, as I said, have ^ looked at the expectation and variance of β1 conditional on x1; : : : xn, either because the x's really are non-random (e.g., controlled by us), or because we're just interested in conditional inference.

Load more