Least-Squares Estimation: Recall That the Projection of Y Onto C(X), the Set of All Vectors of the Form Xb for B ∈ Rk+1, Yields the Closest Point in C(X) to Y

Least-Squares Estimation: Recall that the projection of y onto C(X), the set of all vectors of the form Xb for b 2 Rk+1, yields the closest point in C(X) to y. That is, p(yjC(X)) yields the minimizer of Q(¯) = ky ¡ X¯k2 (the least squares criterion) This leads to the estimator ¯^ given by the solution of XT X¯ = XT y (the normal equations) or ¯^ = (XT X)¡1XT y: All of this has already been established back when we studied projections (see pp. 30{31). Alternatively, we could use calculus: To ¯nd a stationary point (maximum, minimum, or saddle point) of Q(¯), we set the partial derivative of Q(¯) equal to zero and solve: @ @ @ Q(¯) = (y ¡ X¯)T (y ¡ X¯) = (yT y ¡ 2yT X¯ + ¯T (XT X)¯) @¯ @¯ @¯ = 0 ¡ 2XT y + 2XT X¯ @ T @ T Here we've used the vector di®erentiation formulas @z c z = c and @z z Az = 2Az (see x2.14 of our text). Setting this result equal to zero, we obtain the normal equations, which has solution ¯^ = (XT X)¡1XT y. That this is a minimum rather than a max, or saddle point can be veri¯ed by checking the second derivative matrix of Q(¯): @2Q(¯) = 2XT X @¯ which is positive de¯nite (result 7, p. 54), therefore ¯^ is a minimum. 101 Example | Simple Linear Regression Consider the case k = 1: yi = ¯0 + ¯1xi + ei; i = 1; : : : ; n 2 where e1; : : : ; en are i.i.d. each with mean 0 and variance σ . Then the model equation becomes 0 1 0 1 0 1 y 1 x e 1 1 µ ¶ 1 B y2 C B 1 x2 C ¯ B e2 C B . C = B . C 0 + B . C : @ . A @ . A ¯1 @ . A . | {z } . y 1 x e n | {z n } =¯ n =X It follows that µ P ¶ µ P ¶ T n i xi T i yi X X = P P 2 ; X y = P xi x xiyi i i i µ P i P ¶ 1 x2 ¡ x T ¡1 P P Pi i i i (X X) = 2 2 : n i xi ¡ ( i xi) ¡ i xi n Therefore, ¯^ = (XT X)¡1XT y yields µ ¶ µ P P P P ¶ ¯^ 1 ( x2)( y ) ¡ ( x )( x y ) ¯^ = 0 = P P i Pi i Pi i iP i i i : ^ 2 2 ¯1 n i xi ¡ ( i xi) ¡( i xi)( i yi) + n i xiyi After a bit of algebra, these estimators simplify to P (x ¡ x¹)(y ¡ y¹) S ^ i Pi i xy ¯1 = 2 = i(xi ¡ x¹) Sxx ^ ^ and ¯0 =y ¹ ¡ ¯1x¹ 102 In the case that X is of full rank, ¯^ and ¹^ are given by ^ T ¡1 T ^ T ¡1 T ¯ = (X X) X y; ¹^ = X¯ = X(X X) X y = PC(X)y: ² Notice that both ¯^ and ¹^ are linear functions of y. That is, in each case the estimator is given by some matrix times y. Note also that ¯^ = (XT X)¡1XT y = (XT X)¡1XT (X¯ + e) = ¯ + (XT X)¡1XT e: From this representation several important properties of the least squares estimator ¯^ follow easily: 1. (unbiasedness): E(¯^) = E(¯ + (XT X)¡1XT e) = ¯ + (XT X)¡1XT E(e) = ¯: |{z} =0 2. (var-cov matrix) var(¯^) = var(¯ + (XT X)¡1XT e) = (XT X)¡1XT var(e) X(XT X)¡1 | {z } =σ2I = σ2(XT X)¡1 ^ 2 T ¡1 3. (normality) ¯ » Nk(¯; σ (X X) ) (if e is assumed normal). ² These three properties require increasingly strong assumptions. Prop- erty (1) holds under assumptions A1 and A2 (additive error and linearity). ² Property (2) requires, in addition, the assumption of sphericity. ² Property (3) requires assumption A5 (normality). However, later we will present a central limit theorem-like result that establishes the asymptotic normality of ¯^ under certain conditions even when e is not normal. 103 Example | Simple Linear Regression (Continued) Result 2 on the previous page says for var(y) = σ2I, var(¯^) = σ2(XT X)¡1. Therefore, in the simple linear regression case, µ ¶ ^ ¯0 2 T ¡1 var ^ = σ (X X) ¯1 µ P P ¶ σ2 x2 ¡ x P P Pi i i i = 2 2 n x ¡ ( xi) ¡ i xi n i i µi P ¶ σ2 n¡1 x2 ¡x¹ P i i = 2 : i(xi ¡ x¹) ¡x¹ 1 Thus, P · ¸ σ2 x2=n 1 x¹2 ^ P i i 2 P var(¯0) = 2 = σ + 2 ; i(xi ¡ x¹) n i(xi ¡ x¹) σ2 ^ P var(¯1) = 2 ; i(xi ¡ x¹) ¡σ2x¹ ^ ^ P and cov(¯0; ¯1) = 2 i(xi ¡ x¹) ^ ^ ² Note that ifx ¹ > 0, then cov(¯0; ¯1) is negative, meaning that the slope and intercept are inversely related. That is, over repeated samples from the same model, the intercept will tend to decrease when the slope increases. 104 Gauss-Markov Theorem: We have seen that in the spherical errors, full-rank linear model, the least- squares estimator ¯^ = (XT X)¡1XT y is unbiased and it is a linear estimator. The following theorem states that in the class of linear and unbiased estimators, the least-squares estimator is optimal (or best) in the sense that it has minimum variance among all estimators in this class. Gauss-Markov Theorem: Consider the linear model y = X¯ +e where X is n £ (k + 1) of rank k + 1, where n > k + 1, E(e) = 0, and var(e) = 2 ^ σ I. The least-squares estimators ¯j, j = 0; 1; : : : ; k (the elements of ¯^ = (XT X)¡1XT y have minimum variance among all linear unbiased estimators. ^ ^ T ^ Proof: Write ¯j as ¯j = c ¯ where c is the indicator vector containing a 1 ^ T T ¡1 T in the (j + 1)st position and 0's elsewhere. Then ¯j = c (X X) X y = T T ¡1 T a y where a = X(X X) c. The quantity being estimated is ¯j = c ¯ = cT (XT X)¡1XT ¹ = aT ¹ where ¹ = X¯. ~ T Consider an arbitrary linear estimator ¯j = d y of ¯j. For such an esti- ~ T T T mator to be unbiased, it must satisfy E(¯j) = E(d y) = d ¹ = a ¹ for any ¹ 2 C(X). I.e., dT ¹ ¡ aT ¹ = 0 ) (d ¡ a)T ¹ = 0 for all ¹ 2 C(X); or (d ¡ a) ? C(X). Then ~ T T T ^ T ¯j = d y = a y + (d ¡ a) y = ¯j + (d ¡ a) y: ^ T The random variables on the right-hand side, ¯j and (d ¡ a) y, have covariance cov(aT y; (d ¡ a)T y) = aT var(y)(d ¡ a) = σ2aT (d ¡ a) = σ2(dT a ¡ aT a): Since dT ¹ = aT ¹ for any ¹ 2 C(X) and a = X(XT X)¡1c 2 C(X), it follows that dT a = aT a so that cov(aT y; (d ¡ a)T y) = σ2(dT a ¡ aT a) = σ2(aT a ¡ aT a) = 0: 105 It follows that ~ ^ T ^ 2 2 var(¯j) = var(¯j) + var((d ¡ a) y) = var(¯j) + σ jjd ¡ ajj : ~ ^ Therefore, var(¯j) ¸ var(¯j) with equality if and only if d = a, or equiva- ~ ^ lently, if and only if ¯j = ¯j. Comments: 1. Notice that nowhere in this proof did we make use of the speci¯c form of c as an indicator for one of the elements of ¯. That is, we have proved a slightly more general result than that given in the statement of the theorem. We have proved that cT ¯^ is the minimum variance estimator in the class of linear unbiased estimators of cT ¯ for any vector of constant c. 2. The least-squares estimator cT ¯^ where ¯^ = (XT X)¡1XT y is often called the B.L.U.E. (best linear unbiased estimator) of cT ¯. Some- times, it is called the Gauss-Markov estimator. 3. The variance of the BLUE is £ ¤T £ ¤ var(cT ¯^) = σ2kak2 = σ2 X(XT X)¡1c X(XT X)¡1c = σ2 cT (XT X)¡1c : Note that this variance formula depends upon X through (XT X)¡1. Two implications of this observation are: { If the columns of the X matrix are mutually orthogonal, then (XT X)¡1 will be diagonal, so that the elements of ¯^ are un- correlated. { Even for a given set of explanatory variables, the values at which the explanatory variable are observed will a®ect the variance (precision) of the resulting parameter estimators. 4. What is remarkable about the Gauss-Markov Theorem is its distri- butional generality. It does not require normality! It says that ¯^ is BLUE regardless of the distribution of e (or y) as long as we have mean zero, spherical errors. 106 An additional property of least-squares estimation is that the estimated mean ¹^ = X(XT X)¡1XT y is invariant to (doesn't change as a result of) linear changes of scale in the explanatory variables. That is, consider the linear models 0 1 1 x11 x12 ¢ ¢ ¢ x1k B 1 x21 x22 ¢ ¢ ¢ x2k C y = B . C ¯ + e @ . .. A 1 x x ¢ ¢ ¢ x | n1 n{z2 nk } =X and 0 1 1 c1x11 c2x12 ¢ ¢ ¢ ckx1k B 1 c x c x ¢ ¢ ¢ c x C 1 21 2 22 k 2k ¤ y = B . C ¯ + e @ . .. A 1 c x c x ¢ ¢ ¢ c x | 1 n1 2 {zn2 k nk } =Z Then, ¹^, the least squares estimator of E(y), is the same in both of these two models. This follows from a more general theorem: Theorem: In the linear model y = X¯ + e where E(e) = 0 and X is of full rank, ¹^, the least-squares estimator of E(y) is invariant to a full rank linear transformation of X. Proof: A full rank linear transformation of X is given by Z = XH where H is square and of full rank.

Load more