7 Multiple Linear Regression

MAS/MASOR/MASSM - Statistical Analysis - Autumn Term 7 Multiple Linear Regression 7.1 The multiple linear regression model In multiple linear regression, a response variable y is expressed as a linear function of k regressor (or predictor or explanatory) variables, x1; x2; : : : ; xk, with corresponding model of the form y = ¯0 + ¯1x1 + ¯2x2 + ::: + ¯kxk + ²: Suppose that n observations have been made of the variables (where n ¸ k + 1) with xij the i-th observed value of the j-th regressor variable, corresponding to the i-th observed value yi of the response variable. Thus the data could be tabulated in the following form of a data matrix. Observation Variable Number y x1 x2 ... xk 1 y1 x11 x12 ... x1k 2 y2 x21 x22 ... x2k . n yn xn1 xn2 ... xnk The multiple linear regression model is Xk yi = ¯0 + ¯jxij + ²i i=1; : : : ; n (1) j=1 where the xij; i=1; : : : ; n; j =1; : : : ; k, are regarded as ¯xed, ¯0; ¯1; : : : ; ¯k are unknown 2 2 parameters and the errors ²i; i=1; : : : ; n, are assumed to be NID(0; σ ), with σ unknown. The model (1) may be written in matrix notation as y = X¯ + ² (2) where y is an n £ 1 vector of observations, 0 y = (y1; y2; : : : ; yn) ; ¯ is a p £ 1 vector of parameters, where p = k + 1, 0 ¯ = (¯0; ¯1; : : : ; ¯k) ; ² is an n £ 1 vector of errors, 0 ² = (²1; ²2; : : : ; ²n) ; and X is an n £ p matrix, the design matrix, 0 1 1 x11 x12 : : : x1k B C B 1 x x : : : x C B 21 22 2k C X = B . C : @ . A 1 xn1 xn2 : : : xnk 1 Equation (2) expresses the regression model in the form of what is known as the general linear model. ² The general linear model encompasses a wide variety of cases, including models that arise from experimental design, of which the model for the completely randomized design is one of the simplest examples. For these models the design matrix X has a rather di®erent form than for the regression model in that each element of the matrix is either 0 or 1. The term \design matrix" is used because X expresses the structure of the underlying experimental design. According to the method of least squares, we choose as our estimates of the vector of 0 parameters ¯ the vector b = (b0; b1; : : : ; bk) whose elements jointly minimize the error (or residual) sum of squares, L = (y ¡ Xb)0(y ¡ Xb); i.e., 0 12 Xn Xk @ A L = yi ¡ xijbj (3) i=1 j=0 where xi0 = 1; i = 1; : : : ; n. The expression (3) is minimized by setting the partial derivatives with respect to each of the br; r = 0; : : : ; k, equal to zero. This yields the normal equations, a set of p = k + 1 simultaneous linear equations for the p unknowns, b0; b1; : : : ; bk, Xn Xk Xn xirxijbj = xiryi r =0; : : : ; k i=1 j=0 i=1 which may be written in matrix form as X0Xb = X0y: (4) Note that X0X is a symmetric p £ p matrix. 7.2 Rank and invertibility The rank, rank(A), of a matrix A is the number of linearly independent columns of A. Recall that our design matrix X is an n£p matrix with n ¸ p. It follows that rank(X) · p. If rank(X) = p then X is said to be of full rank. It may be shown that rank(X0X) = rank(X): (5) A square matrix is said to be non-singular if it has an inverse. A p £ p square matrix is non-singular if and only if it is of full rank p. If X0X is non-singular, which by the result (5) occurs if and only if X is of full rank p, then the normal equations (4) have a unique solution, b = (X0X)¡1X0y: (6) 2 It will generally be the case for sensible regression models that the design matrix X is of full rank, but this is not necessarily always the case. To take an extreme example, if one of the regressor variables is a scaled version of another then the two corresponding columns of the matrix X are scalar multiples of each other and hence rank(X) < p. The normal equations do not then have a unique solution { the estimates of the parameters are not well-determined. For a given set of data, assuming that X is of full rank p, the formal mathematical solution (6) of the normal equations (4) is translated in a statistical package such as S-PLUS into a numerical procedure for solving the normal equations. 7.3 The general linear model for a randomized design Consider, for simplicity, the case of a completely randomized design with two treatments and two observations per treatment. The corresponding model is yij = ¹ + ¿i + ²ij; i=1; 2; j =1; 2: This can be written in matrix form as 0 1 0 1 0 1 y11 1 1 0 0 1 ²11 B C B C ¹ B C B y12 C B 1 1 0 C B C B ²12 C B C = B C @ ¿1 A + B C : @ y21 A @ 1 0 1 A @ ²21 A ¿2 y22 1 0 1 ²22 This is the form of the general linear model with design matrix X given by 0 1 1 1 0 B C B 1 1 0 C X = B C : @ 1 0 1 A 1 0 1 Note that rank(X) = 2 < 3 = p, so that the design matrix is of less than full rank, which is a characteristic of the design matrices arising from designed experiments with the standard form of the statistical model. For this design matrix, 0 1 4 2 2 B C X0X = @ 2 2 0 A : 2 0 2 Here X0X is singular with rank(X0X) = 2. Hence the normal equations do not have a unique solution. A way of dealing with this, as we saw earlier, is to introduce a constraint on the treatment e®ects. 7.4 The hat matrix Return now to the case of multiple linear regression and assume that X is of full rank. The vector ^y of ¯tted values is given by ^y = Xb = Hy; (7) 3 where, using Equation (6), the hat matrix H is de¯ned by H = X(X0X)¡1X0: (8) Note that H is a symmetric n £ n matrix. The vector e of residuals is given by e = y ¡ ^y = (I ¡ H)y; (9) where I is the n £ n identity matrix. Before going any further it is helpful to note some results about the matrices H and I ¡ H. A matrix P is said to be idempotent if P2 = P. From Equation (8), H2 = X(X0X)¡1X0X(X0X)¡1X0 = X(X0X)¡1X0 = H: (10) Thus H is idempotent. Furthermore, using Equation (10), (I ¡ H)2 = I2 ¡ 2H + H2 = I ¡ H: (11) Thus I ¡ H is also an n £ n symmetric idempotent matrix. Again using Equation (10), H(I ¡ H) = O; (12) where O represents a matrix of zeros, in this case a n £ n matrix. Post-multiplying Equation (8) by X, we ¯nd that HX = X: (13) Hence (I ¡ H)X = O; (14) where O again represents a matrix of zeros, but now a n £ p matrix. Recall that the trace of a matrix is the sum of its diagonal elements. tr(H) = tr(X(X0X)¡1X0) = tr((X0X)¡1X0X) = tr(Ip) = p; where Ip is the p £ p identity matrix. It follows that tr(I ¡ H) = tr(I) ¡ tr(H) = n ¡ p: It turns out that the rank of a symmetric idempotent matrix is equal to its trace. Hence rank(H) = p and rank(I ¡ H) = n ¡ p: 4 7.5 A geometrical interpretation There is a geometrical interpretation of the method of least squares. Recalling Equations (7) and (9), in the decomposition y = ^y + e = Hy + (I ¡ H)y; (15) the vector of ¯tted values ^y is the orthogonal projection of y onto the p-dimensional space spanned by the columns of X. The vector of residuals e is orthogonal to the columns of X, since, using Equation (9) and the transpose of Equation (14), X0e = X0(I ¡ H)y = 0; where 0 represents a p £ 1 vector of zeros. Squaring the decomposition of Equation (15) and using Equation (12) to eliminate the cross-product term on the right hand side, y0y = ^y0^y + e0e; or, to use a di®erent notation, kyk2 = k^yk2 + kek2: This is a version of Pythagoras' theorem and essentially the basis for the decomposition of the total sum of squares in the ANOVA for multiple regression. The second term on Pn 2 the right hand side is just the residual sum of squares, i=1 ei . 7.6 Properties of the least squares estimator In the linear model as speci¯ed in Equation (2), E(²) = 0 and cov(²) = σ2I. It follows that E(y) = X¯ and cov(y) = σ2I. We now consider the properties of the least squares estimator b as speci¯ed in Equation (6). (For the present, we do not need to use the normality assumption for the error distribution.) E(b) = E((X0X)¡1X0y) = (X0X)¡1X0E(y) = (X0X)¡1X0X¯ = ¯: (16) Thus b is an unbiased estimator of ¯. The covariance matrix of b is found as follows. cov(b) = cov((X0X)¡1X0y) = (X0X)¡1X0 cov(y) X(X0X)¡1 = (X0X)¡1X0 σ2IX(X0X)¡1 = σ2(X0X)¡1: (17) 5 The Gauss-Markov Theorem For any p £ 1 vector a, a0b is the unique minimum variance linear unbiased estimator of a0¯. Proof Let c0y be any linear unbiased estimator of a0¯.

7 Multiple Linear Regression

Lecture 13: Simple Linear Regression in Matrix Format

Introducing the Game Design Matrix: a Step-By-Step Process for Creating Serious Games

Stat 5102 Notes: Regression

Uncertainty of the Design and Covariance Matrices in Linear Statistical Model*

The Concept of a Generalized Inverse for Matrices Was Introduced by Moore(1920)

Kriging Models for Linear Networks and Non‐Euclidean Distances

Linear Regression with Shuffled Data: Statistical and Computational

Week 7: Multiple Regression

Stat 714 Linear Statistical Models

Fitting Differential Equations to Functional Data: Principal Differential Analysis

Linear Regression in Matrix Form

Some Matrix Results