Lecture 13: Simple Linear Regression in Matrix Format
Total Page:16
File Type:pdf, Size:1020Kb
11:55 Wednesday 14th October, 2015 See updates and corrections at http://www.stat.cmu.edu/~cshalizi/mreg/ Lecture 13: Simple Linear Regression in Matrix Format 36-401, Section B, Fall 2015 13 October 2015 Contents 1 Least Squares in Matrix Form 2 1.1 The Basic Matrices . .2 1.2 Mean Squared Error . .3 1.3 Minimizing the MSE . .4 2 Fitted Values and Residuals 5 2.1 Residuals . .7 2.2 Expectations and Covariances . .7 3 Sampling Distribution of Estimators 8 4 Derivatives with Respect to Vectors 9 4.1 Second Derivatives . 11 4.2 Maxima and Minima . 11 5 Expectations and Variances with Vectors and Matrices 12 6 Further Reading 13 1 2 So far, we have not used any notions, or notation, that goes beyond basic algebra and calculus (and probability). This has forced us to do a fair amount of book-keeping, as it were by hand. This is just about tolerable for the simple linear model, with one predictor variable. It will get intolerable if we have multiple predictor variables. Fortunately, a little application of linear algebra will let us abstract away from a lot of the book-keeping details, and make multiple linear regression hardly more complicated than the simple version1. These notes will not remind you of how matrix algebra works. However, they will review some results about calculus with matrices, and about expectations and variances with vectors and matrices. Throughout, bold-faced letters will denote matrices, as a as opposed to a scalar a. 1 Least Squares in Matrix Form Our data consists of n paired observations of the predictor variable X and the response variable Y , i.e., (x1; y1);::: (xn; yn). We wish to fit the model Y = β0 + β1X + (1) where E [jX = x] = 0, Var [jX = x] = σ2, and is uncorrelated across mea- surements2. 1.1 The Basic Matrices Group all of the observations of the response into a single column (n×1) matrix y, 2 3 y1 6 y2 7 y = 6 7 (2) 6 . 7 4 . 5 yn Similarly, we group both the coefficients into a single vector (i.e., a 2 × 1 matrix) β β = 0 (3) β1 We'd also like to group the observations of the predictor variable together, but we need something which looks a little unusual at first: 2 3 1 x1 6 1 x2 7 x = 6 7 (4) 6 . 7 4 . 5 1 xn 1Historically, linear models with multiple predictors evolved before the use of matrix alge- bra for regression. You may imagine the resulting drudgery. 2When I need to also assume that is Gaussian, and strengthen \uncorrelated" to \inde- pendent", I'll say so. 11:55 Wednesday 14th October, 2015 3 1.2 Mean Squared Error This is an n×2 matrix, where the first column is always 1, and the second column contains the actual observations of X. We have this apparently redundant first column because of what it does for us when we multiply x by β: 2 3 β0 + β1x1 6 β0 + β1x2 7 xβ = 6 7 (5) 6 . 7 4 . 5 β0 + β1xn That is, xβ is the n × 1 matrix which contains the point predictions. The matrix x is sometimes called the design matrix. 1.2 Mean Squared Error At each data point, using the coefficients β results in some error of prediction, so we have n prediction errors. These form a vector: e(β) = y − xβ (6) (You can check that this subtracts an n × 1 matrix from an n × 1 matrix.) When we derived the least squares estimator, we used the mean squared error, n 1 X MSE(β) = e2(β) (7) n i i=1 How might we express this in terms of our matrices? I claim that the correct form is 1 MSE(β) = eT e (8) n To see this, look at what the matrix multiplication really involves: 2 3 e1 6 e2 7 [e e : : : e ] 6 7 (9) 1 2 n 6 . 7 4 . 5 en P 2 This, clearly equals i ei , so the MSE has the claimed form. Let us expand this a little for further use. 1 MSE(β) = eT e (10) n 1 = (y − xβ)T (y − xβ) (11) n 1 = (yT − βT xT )(y − xβ) (12) n 1 = yT y − yT xβ − βT xT y + βT xT xβ (13) n 11:55 Wednesday 14th October, 2015 4 1.3 Minimizing the MSE Notice that (yT xβ)T = βT xT y. Further notice that this is a 1 × 1 matrix, so yT xβ = βT xT y. Thus 1 MSE(β) = yT y − 2βT xT y + βT xT xβ (14) n 1.3 Minimizing the MSE First, we find the gradient of the MSE with respect to β: 1 rMSE(β = ryT y − 2rβT xT y + rβT xT xβ (15) n 1 = 0 − 2xT y + 2xT xβ (16) n 2 = xT xβ − xT y (17) n We now set this to zero at the optimum, βb: xT xβb − xT y = 0 (18) This equation, for the two-dimensional vector βb, corresponds to our pair of nor- ^ ^ mal or estimating equations for β0 and β1. Thus, it, too, is called an estimating equation. Solving, βb = (xT x)−1xT y (19) That is, we've got one matrix equation which gives us both coefficient estimates. If this is right, the equation we've got above should in fact reproduce the least-squares estimates we've already derived, which are of course ^ cXY xy − x¯y¯ β1 = = (20) 2 2 2 sX x − x¯ and ^ ^ β0 = y − β1x (21) Let's see if that's right. As a first step, let's introduce normalizing factors of 1=n into both the matrix products: βb = (n−1xT x)−1(n−1xT y) (22) Now let's look at the two factors in parentheses separately, from right to left. 2 3 y1 1 1 1 1 ::: 1 6 y2 7 xT y = 6 7 (23) x x : : : x 6 . 7 n n 1 2 n 4 . 5 yn P 1 yi = P i (24) n i xiyi y = (25) xy 11:55 Wednesday 14th October, 2015 5 Similarly for the other factor: P 1 T 1 n i xi x x = P P 2 (26) n n i xi i xi 1 x = (27) x x2 Now we need to take the inverse: −1 1 1 x2 −x xT x = (28) n x2 − x¯2 −x 1 1 x2 −x = 2 (29) sX −x 1 Let's multiply together the pieces. 2 T −1 T 1 x −x y (x x) x y = 2 (30) sX −x 1 xy 1 x2y − xxy = 2 (31) sX −xy + xy 2 2 1 (sX +x ¯ )y − x(cXY +x ¯y¯) = 2 (32) sX cXY 2 2 2 1 sxy +x ¯ y − xcXY − x y¯ = 2 (33) sX cXY " cXY # y − 2 x sX = cXY (34) 2 sX which is what it should be. So: n−1xT y is keeping track of y and xy, and n−1xT x keeps track of x and x2. The matrix inversion and multiplication then handles all the book- keeping to put these pieces together to get the appropriate (sample) variances, covariance, and intercepts. We don't have to remember that any more; we can just remember the one matrix equation, and then trust the linear algebra to take care of the details. 2 Fitted Values and Residuals Remember that when the coefficient vector is β, the point predictions for each data point are xβ. Thus the vector of fitted values, m\(x), or mb for short, is mb = xβb (35) Using our equation for βb, T −1 T mb = x(x x) x y (36) 11:55 Wednesday 14th October, 2015 6 Notice that the fitted values are linear in y. The matrix H ≡ x(xT x)−1xT (37) does not depend on y at all, but does control the fitted values: mb = Hy (38) If we repeat our experiment (survey, observation. ) many times at the same x, we get different y every time. But H does not change. The properties of the fitted values are thus largely determined by the properties of H. It thus deserves a name; it's usually called the hat matrix, for obvious reasons, or, if we want to sound more respectable, the influence matrix. Let's look at some of the properties of the hat matrix. 1. Influence Since H is not a function of y, we can easily verify that @mb i=@yj = th Hij. Thus, Hij is the rate at which the i fitted value changes as we vary the jth observation, the “influence” that observation has on that fitted value. 2. Symmetry It's easy to see that HT = H. 3. Idempotency A square matrix a is called idempotent3 when a2 = a (and so ak = a for any higher power k). Again, by writing out the multiplication, H2 = H, so it's idempotent. Idemopotency, Projection, Geometry Idempotency seems like the most obscure of these properties, but it's actually one of the more important. y and mb are n-dimensional vectors. If we project a vector u on to the line in the direction of the length-one vector v, we get vvT u (39) (Check the dimensions: u and v are both n × 1, so vT is 1 × n, and vT u is 1 × 1.) If we group the first two terms together, like so, (vvT )u (40) where vvT is the n × n project matrix or projection operator for that line. Since v is a unit vector, vT v = 1, and (vvT )(vvT ) = vvT (41) so the projection operator for the line is idempotent.