Lecture 13: Simple Linear Regression in Matrix Format 1 Expectations And

Lecture 13: Simple Linear Regression in Matrix Format To move beyond simple regression we need to use matrix algebra. We'll start by re-expressing simple linear regression in matrix form. Linear algebra is a pre-requisite for this class; I strongly urge you to go back to your textbook and notes for review. 1 Expectations and Variances with Vectors and Matrices T If we have p random variables, Z1;Z2;:::Zp, we can put them into a random vector Z = [Z1Z2 :::Zp] . This random vector can be thought of as a p × 1 matrix of random variables. This expected value of Z is defined to be the vector 2 3 E [Z1] 6 E [Z2] 7 µ ≡ [Z] = 6 7 : (1) E 6 . 7 4 . 5 E [Zp] If a and b are non-random scalars, then E [aZ + bW] = aE [Z] + bE [W] : (2) If a is a non-random vector then T T E(a Z) = a E(Z): If A is a non-random matrix, then E [AZ] = AE [Z] : (3) Every coordinate of a random vector has some covariance with every other coordinate. The variance-covariance matrix of Z is the p × p matrix which stores these value. In other words, 2 3 Var [Z1] Cov [Z1;Z2] ::: Cov [Z1;Zp] 6 Cov [Z2;Z1] Var [Z2] ::: Cov [Z2;Zp] 7 Var [Z] ≡ 6 7 : (4) 6 . 7 4 . 5 Cov [Zp;Z1] Cov [Zp;Z2] ::: Var [Zp] 2 This inherits properties of ordinary variances and covariances. Just as Var [Z] = E Z − 2 (E [Z]) , we have T T Var [Z] = E ZZ − E [Z](E [Z]) (5) For a non-random vector a and a non-random scalar b, Var [a + bZ] = b2Var [Z] : (6) For a non-random matrix C, Var [CZ] = CVar [Z] CT : (7) (Check that the dimensions all conform here: if c is q × p, Var [cZ] should be q × q, and so is the right-hand side.) 1 A random vector Z has a multivariate Normal distribution with mean µ and variance Σ if its density is 1 1 f(z) = exp − (z − µ)T Σ−1(z − µ) : (2π)p=2jΣj1=2 2 We write this as Z ∼ N(µ, Σ) or Z ∼ MVN(µ, Σ). If A is a square matrix, then the trace of A | denoted by A | is defined to be the sum of the diaginal elements. In other words, X tr A = Ajj: j Recall that the trace satisfies these properties: tr(A + B) = tr(A) + tr(B); tr(cA) = c tr(A); tr(AT ) = tr(A) and we have the cyclic property tr(ABC) = tr(BCA) = tr(CAB): If C is non-random, then ZT CZ is called a quadratic form. We have that T T E Z CZ = E [Z] CE [Z] + tr[CVar [Z]]: (8) To see this, notice that ZT CZ = tr ZT CZ (9) because it's a 1 × 1 matrix. But the trace of a matrix product doesn't change when we cyclicly permute the matrices, so ZT CZ = tr CZZT (10) Therefore T T E Z CZ = E tr CZZ (11) T = tr E CZZ (12) T = tr CE ZZ (13) T = tr C(Var [Z] + E [Z] E [Z] ) (14) T = tr CVar [Z] + tr CE [Z] E [Z] ) (15) T = tr CVar [Z] + tr E [Z] CE [Z] (16) T = tr CVar [Z] + E [Z] CE [Z] (17) using the fact that tr is a linear operation so it commutes with taking expectations; the decompo- sition of Var [Z]; the cyclic permutation trick again; and finally dropping tr from a scalar. Unfortunately, there is generally no simple formula for the variance of a quadratic form, unless the random vector is Gaussian. If Z ∼ N(µ, Σ) then Var(ZT CZ) = 2 tr(CΣCΣ) + 4µT CΣCµ. 2 Least Squares in Matrix Form Our data consists of n paired observations of the predictor variable X and the response variable Y , i.e., (X1;Y1);::: (Xn;Yn). We wish to fit the model Y = β0 + β1X + (18) 2 where E [jX = x] = 0, Var [jX = x] = σ , and is uncorrelated across measurements. 2 2.1 The Basic Matrices 2 3 2 3 2 3 Y1 1 X1 1 6 Y2 7 β 6 1 X2 7 6 2 7 Y = 6 7 ; β = 0 ; X = 6 7 ; = 6 7 : (19) 6 . 7 6 . 7 6 . 7 4 . 5 β1 4 . 5 4 . 5 Yn 1 Xn n Note that X | which is called the design matrix | is an n × 2 matrix, where the first column is always 1, and the second column contains the actual observations of X. Now 2 3 β0 + β1X1 6 β0 + β1X2 7 Xβ = 6 7 : (20) 6 . 7 4 . 5 β0 + β1Xn So we can write the set of equations Yi = β0 + β1Xi + i; i = 1; : : : ; n in the simpler form Y = Xβ + . 2.2 Mean Squared Error Let e ≡ e(β) = Y − Xβ: (21) The training error (or observed mean squared error) is n 1 X 1 MSE(β) = e2(β) = MSE(β) = eT e: (22) n i n i=1 Let us expand this a little for further use. We have: 1 MSE(β) = eT e (23) n 1 = (Y − Xβ)T (Y − Xβ) (24) n 1 = (YT − βT XT )(Y − Xβ) (25) n 1 = YT Y − YT Xβ − βT XT Y + βT XT Xβ (26) n 1 = YT Y − 2βT XT Y + βT XT Xβ (27) n where we used the fact that βT XT Y = (YT Xβ)T = YT Xβ. 3 2.3 Minimizing the MSE First, we find the gradient of the MSE with respect to β: 1 rMSE(β) = rYT Y − 2rβT XT Y + rβT XT Xβ (28) n 1 = 0 − 2XT Y + 2XT Xβ (29) n 2 = XT Xβ − XT Y (30) n We now set this to zero at the optimum, βb: XT Xβb − XT Y = 0: (31) This equation, for the two-dimensional vector βb, corresponds to our pair of normal or estimating equations for βb0 and βb1. Thus, it, too, is called an estimating equation. Solving, we get βb = (XT X)−1XT Y: (32) That is, we've got one matrix equation which gives us both coefficient estimates. If this is right, the equation we've got above should in fact reproduce the least-squares estimates we've already derived, which are of course cXY βb1 = 2 ; βb0 = Y − βb1X: (33) sX Let's see if that's right. As a first step, let's introduce normalizing factors of 1=n into both the matrix products: βb = (n−1XT X)−1(n−1XT Y) (34) Now 2 3 Y1 1 1 1 1 ::: 1 6 Y2 7 XT Y = 6 7 (35) 6 . 7 n n X1 X2 :::Xn 4 . 5 Yn P 1 i Yi Y = P = : (36) n i XiYi XY Similarly, P 1 T 1 n i Xi 1 X X X = P P 2 = 2 : (37) n n i Xi i Xi X X Hence, −1 2 2 1 T 1 X −X 1 X −X X X = 2 = 2 : n X2 − X −X 1 sX −X 1 4 Therefore, 2 T −1 T 1 X −X Y (X X) X Y = 2 (38) sX −X 1 XY 1 X2Y − X XY = 2 (39) sX −(X Y ) + XY " # 1 2 2 (sX + X )Y − X(cXY + X Y ) = 2 (40) sX cXY " 2 2 2 # 1 sxY + X Y − XcXY − X Y = 2 (41) sX cXY " cXY # " # Y − 2 X β sX b0 = cXY = : (42) 2 βb1 sX 3 Fitted Values and Residuals Remember that when the coefficient vector is β, the point predictions (fitted values) for each data point are Xβ. Thus the vector of fitted values is Yb ≡ m\(X) ≡ mb = Xβ:b Using our equation for βb, we then have Yb = Xβb = X(XT X)−1XT Y = HY where H ≡ X(XT X)−1XT (43) is called the hat matrix or the influence matrix. Let's look at some of the properties of the hat matrix. th 1. Influence. Check that @Yb i=@Yj = Hij. Thus, Hij is the rate at which the i fitted value changes as we vary the jth observation, the “influence” that observation has on that fitted value. 2. Symmetry. It's easy to see that HT = H. 3. Idempotency. Check that H2 = H, so the matrix is idempotent. Geometry. A symmtric, idempotent matrix is a projection matrix. This means that H projects n Y into a lower dimensional subspace. Specifically, Y is a point in R but Yb = HY is a linear combination of two vectors, namely, the two columns of X. In other words: H projects Y onto the column space of X. The column space of X is the set of vectors that can be written as linear combinations of the columns of X. 5 3.1 Residuals The vector of residuals, e, is e ≡ Y − Yb = Y − HY = (I − H)Y: (44) Here are some properties of I − H: 1. Influence. @ei=@yj = (I − H)ij. 2. Symmetry. (I − H)T = I − H. 3. Idempotency. (I − H)2 = (I − H)(I − H) = I − H − H + H2. But, since H is idempotent, H2 = H, and thus (I − H)2 = (I − H). Thus, 1 1 MSE(βb) = YT (I − H)T (I − H)Y = YT (I − H)Y: (45) n n 3.2 Expectations and Covariances Remember that Y = Xβ + where is an n × 1 matrix of random variables, with mean vector 0, and variance-covariance matrix σ2I. What can we deduce from this? First, the expectation of the fitted values: E[Yb ] = E [HY] = HE [Y] = HXβ + HE [] (46) = X(XT X)−1XT Xβ + 0 = Xβ: (47) Next, the variance-covariance of the fitted values: h i Var Yb = Var [HY] = Var [H(Xβ + )] (48) = Var [H] = HVar [] HT = σ2HIH = σ2H (49) using the symmetry and idempotency of H.

Load more