<<

MAS/MASOR/MASSM - Statistical Analysis - Autumn Term

7 Multiple

7.1 The multiple linear regression model In multiple linear regression, a response variable y is expressed as a linear function of k regressor (or predictor or explanatory) variables, x1, x2, . . . , xk, with corresponding model of the form y = β0 + β1x1 + β2x2 + ... + βkxk + ².

Suppose that n observations have been made of the variables (where n ≥ k + 1) with xij the i-th observed value of the j-th regressor variable, corresponding to the i-th observed value yi of the response variable. Thus the could be tabulated in the following form of a data .

Observation Variable Number y x1 x2 ... xk 1 y1 x11 x12 ... x1k 2 y2 x21 x22 ... x2k ...... n yn xn1 xn2 ... xnk The multiple linear regression model is

Xk yi = β0 + βjxij + ²i i=1, . . . , n (1) j=1

where the xij, i=1, . . . , n, j =1, . . . , k, are regarded as fixed, β0, β1, . . . , βk are unknown 2 2 parameters and the errors ²i, i=1, . . . , n, are assumed to be NID(0, σ ), with σ unknown.

The model (1) may be written in matrix notation as y = Xβ + ² (2) where y is an n × 1 vector of observations,

0 y = (y1, y2, . . . , yn) , β is a p × 1 vector of parameters, where p = k + 1,

0 β = (β0, β1, . . . , βk) , ² is an n × 1 vector of errors, 0 ² = (²1, ²2, . . . , ²n) , and X is an n × p matrix, the design matrix,   1 x11 x12 . . . x1k    1 x x . . . x   21 22 2k  X =  . . . .  .  . . . .  1 xn1 xn2 . . . xnk

1 Equation (2) expresses the regression model in the form of what is known as the .

• The general linear model encompasses a wide variety of cases, including models that arise from experimental design, of which the model for the completely randomized design is one of the simplest examples. For these models the design matrix X has a rather different form than for the regression model in that each element of the matrix is either 0 or 1. The term “design matrix” is used because X expresses the structure of the underlying experimental design.

According to the method of least squares, we choose as our estimates of the vector of 0 parameters β the vector b = (b0, b1, . . . , bk) whose elements jointly minimize the error (or residual) sum of squares,

L = (y − Xb)0(y − Xb),

i.e.,  2 Xn Xk   L = yi − xijbj (3) i=1 j=0

where xi0 = 1, i = 1, . . . , n. The expression (3) is minimized by setting the partial derivatives with respect to each of the br, r = 0, . . . , k, equal to zero. This yields the normal equations, a set of p = k + 1 simultaneous linear equations for the p unknowns, b0, b1, . . . , bk, Xn Xk Xn xirxijbj = xiryi r =0, . . . , k i=1 j=0 i=1 which may be written in matrix form as

X0Xb = X0y. (4)

Note that X0X is a symmetric p × p matrix.

7.2 Rank and invertibility The rank, rank(A), of a matrix A is the number of linearly independent columns of A.

Recall that our design matrix X is an n×p matrix with n ≥ p. It follows that rank(X) ≤ p. If rank(X) = p then X is said to be of full rank. It may be shown that

rank(X0X) = rank(X). (5)

A square matrix is said to be non-singular if it has an inverse. A p × p square matrix is non-singular if and only if it is of full rank p.

If X0X is non-singular, which by the result (5) occurs if and only if X is of full rank p, then the normal equations (4) have a unique solution,

b = (X0X)−1X0y. (6)

2 It will generally be the case for sensible regression models that the design matrix X is of full rank, but this is not necessarily always the case. To take an extreme example, if one of the regressor variables is a scaled version of another then the two corresponding columns of the matrix X are scalar multiples of each other and hence rank(X) < p. The normal equations do not then have a unique solution – the estimates of the parameters are not well-determined.

For a given set of data, assuming that X is of full rank p, the formal mathematical solution (6) of the normal equations (4) is translated in a statistical package such as S-PLUS into a numerical procedure for solving the normal equations.

7.3 The general linear model for a randomized design Consider, for simplicity, the case of a completely randomized design with two treatments and two observations per treatment. The corresponding model is

yij = µ + τi + ²ij, i=1, 2; j =1, 2.

This can be written in matrix form as       y11 1 1 0   ²11     µ    y12   1 1 0     ²12    =    τ1  +   .  y21   1 0 1   ²21  τ2 y22 1 0 1 ²22 This is the form of the general linear model with design matrix X given by   1 1 0    1 1 0  X =   .  1 0 1  1 0 1

Note that rank(X) = 2 < 3 = p, so that the design matrix is of less than full rank, which is a characteristic of the design matrices arising from designed experiments with the standard form of the . For this design matrix,   4 2 2   X0X =  2 2 0  . 2 0 2

Here X0X is singular with rank(X0X) = 2. Hence the normal equations do not have a unique solution. A way of dealing with this, as we saw earlier, is to introduce a constraint on the treatment effects.

7.4 The hat matrix Return now to the case of multiple linear regression and assume that X is of full rank. The vector ˆy of fitted values is given by

ˆy = Xb = Hy, (7)

3 where, using Equation (6), the hat matrix H is defined by H = X(X0X)−1X0. (8) Note that H is a symmetric n × n matrix. The vector e of residuals is given by e = y − ˆy = (I − H)y, (9) where I is the n × n .

Before going any further it is helpful to note some results about the matrices H and I − H. A matrix P is said to be idempotent if P2 = P.

From Equation (8), H2 = X(X0X)−1X0X(X0X)−1X0 = X(X0X)−1X0 = H. (10) Thus H is idempotent. Furthermore, using Equation (10), (I − H)2 = I2 − 2H + H2 = I − H. (11) Thus I − H is also an n × n symmetric . Again using Equation (10), H(I − H) = O, (12) where O represents a matrix of zeros, in this case a n × n matrix.

Post-multiplying Equation (8) by X, we find that HX = X. (13) Hence (I − H)X = O, (14) where O again represents a matrix of zeros, but now a n × p matrix.

Recall that the trace of a matrix is the sum of its diagonal elements. tr(H) = tr(X(X0X)−1X0) = tr((X0X)−1X0X)

= tr(Ip) = p,

where Ip is the p × p identity matrix. It follows that tr(I − H) = tr(I) − tr(H) = n − p. It turns out that the rank of a symmetric idempotent matrix is equal to its trace. Hence rank(H) = p and rank(I − H) = n − p.

4 7.5 A geometrical interpretation There is a geometrical interpretation of the method of least squares. Recalling Equations (7) and (9), in the decomposition

y = ˆy + e = Hy + (I − H)y, (15)

the vector of fitted values ˆy is the orthogonal projection of y onto the p-dimensional space spanned by the columns of X. The vector of residuals e is orthogonal to the columns of X, since, using Equation (9) and the transpose of Equation (14),

X0e = X0(I − H)y = 0,

where 0 represents a p × 1 vector of zeros. Squaring the decomposition of Equation (15) and using Equation (12) to eliminate the cross-product term on the right hand side,

y0y = ˆy0ˆy + e0e,

or, to use a different notation,

kyk2 = kˆyk2 + kek2.

This is a version of Pythagoras’ theorem and essentially the basis for the decomposition of the total sum of squares in the ANOVA for multiple regression. The second term on Pn 2 the right hand side is just the residual sum of squares, i=1 ei .

7.6 Properties of the least squares estimator In the linear model as specified in Equation (2), E(²) = 0 and cov(²) = σ2I. It follows that E(y) = Xβ and cov(y) = σ2I. We now consider the properties of the least squares estimator b as specified in Equation (6). (For the present, we do not need to use the normality assumption for the error distribution.)

E(b) = E((X0X)−1X0y) = (X0X)−1X0E(y) = (X0X)−1X0Xβ = β. (16)

Thus b is an unbiased estimator of β. The of b is found as follows.

cov(b) = cov((X0X)−1X0y) = (X0X)−1X0 cov(y) X(X0X)−1 = (X0X)−1X0 σ2IX(X0X)−1 = σ2(X0X)−1. (17)

5 The Gauss-Markov Theorem For any p × 1 vector a, a0b is the unique minimum variance linear unbiased estimator of a0β. Proof Let c0y be any linear unbiased estimator of a0β. It follows that, for all β, E(c0y) = c0Xβ = a0β. Hence it must be the case that c0X = a0. The variance of the estimator c0y is given by var(c0y) = c0σ2Ic = σ2c0c. Now consider the properties of the estimator a0b. Firstly, it is unbiased, since E(a0b) = a0E(b) = a0β, using Equation (16). Using Equation (17), the variance of a0b is given by var(a0b) = a0cov(b)a = a0σ2(X0X)−1a = σ2c0X(X0X)−1X0c = σ2c0Hc, where H is the hat matrix as defined in Equation (8). Using the fact that I − H is symmetric and idempotent, it follows that var(c0y) − var(a0b) = σ2c0(I − H)c = σ2((I − H)c)0((I − H)c) = σ2k(I − H)ck2 ≥ 0, with equality if and only if c = Hc. This is true if and only if, for all y, c0y = c0Hy = c0X(X0X)−1X0y = a0(X0X)−1X0y = a0b.

7.7 The estimation of the error variance Lemma Let y be an n × 1 vector of random variables and A an n × n of constants. If E(y) = θ and cov(y) = Σ then E(y0Ay) = tr(AΣ) + θ0Aθ. Proof E(y0Ay) = E((y − θ)0A(y − θ) + 2θ0Ay − θ0Aθ) = E((y − θ)0A(y − θ)) + 2θ0AE(y) − θ0Aθ X X 0 = aijE((yi − θi)(yj − θj)) + θ Aθ i j X X 0 = aijσij + θ Aθ i j = tr(AΣ) + θ0Aθ.

6 Recalling Equations (9) and (11), we apply the result of the lemma to the residual sum of squares, e0e = y0(I − H)y, with A = I − H, θ = Xβ and Σ = σ2I. Thus

E(e0e) = σ2tr(I − H) + β0X0(I − H)Xβ.

From Section 7.4, tr(I − H) = n − p. Using Equation (14), the second term on the right hand side of the above equation is zero. Hence

E(e0e) = (n − p)σ2.

Thus an unbiased estimator of the error variance σ2 is given by the residual mean square, e0e s2 ≡ . (18) n − p

7.8 Fitted values, residuals and leverages From the model of Equation (2) it follows that

2 y ∼ Nn(Xβ, σ I).

From Equation (7) E(ˆy) = E(Hy) = HE(y) = HXβ = Xβ, using Equation (13). Furthermore,

cov(ˆy) = H cov(y) H0 = H σ2IH = σ2H,

using the property that H is idempotent. In particular, the variance of the i-th fitted value is given by 2 var(ˆyi) = hiiσ i=1, . . . , n,

where hii is the i-th diagonal element of the hat matrix H and is known as the of the i-th observation.

0 The vector of residuals e = (e1, e2, . . . , en) , which may be calculated from the observed data using the formula of Equation (9), is used to investigate the validity of the assump- 0 tions about the vector of errors ² = (²1, ²2, . . . , ²n) , which is not itself observable. In terms of the multivariate Normal distribution, it was assumed that

2 ² ∼ Nn(0, σ I).

From Equation (9),

E(e) = E((I − H)y) = (I − H)E(y) = (I − H)Xβ = 0,

7 using Equation (14). Furthermore,

cov(e) = cov((I − H)y) = (I − H) cov(y)(I − H)0 = (I − H) σ2I (I − H) = σ2(I − H),

using the property that I − H is idempotent. Hence, using the property of the normal distribution that a of normal variates is also normally distributed, under our assumptions about the error distribution,

2 ei ∼ N(0, (1 − hii)σ ) i=1, . . . , n.

It should be noted, however, that the residuals ei are not independently distributed.

X0e = X0(I − H)y = 0,

using the transpose of Equation (14). Thus there are p linear combinations of the ei that are necessarily equal to zero.

The standardized residuals di are defined by

ei di = √ i=1, . . . , n. s 1 − hii If the assumptions of the regression model are correct, the standardized residuals are ap- proximately NID(0, 1). Through various descriptive and plots of the residuals, standardized residuals and fitted values, the data can be examined for outliers and for indications of departures from the assumptions about the error distribution.

The leverages hii satisfy Xn hii = tr(H) = p, i=1

as seen in Section 7.4. Thus the average value of the hii, i = 1, . . . , n, is equal to p/n. More detailed analysis shows that 1 ≤ h ≤ 1, i=1, . . . , n. n ii

The leverage hii, which depends only on the values of the regressor variables and not on the values of the response variable, may be viewed as a measure of the distance of the i-th observation from the sample mean vector of all n observations in the space of the regressor variables. If an individual leverage hii is large then it may be that the corresponding observation is influential in determining the estimated regression coefficients, that is, the removal of the observation from the data set results in drastic changes in the estimates of the regression coefficients. So observations with large leverage should be treated with caution.

8 There are functions in S-PLUS that allow the examination of residuals, leverages and other diagnostics from a linear regression. In general, an observation may be influential either when it has a large standardized residual (the response variable is an outlier) or when it has a large leverage (the vector of regressor variables is an outlier). When such observations occur, they should be investigated to see if there is any obvious explanation as to why they have occurred. The may be repeated after the removal of some or all of these observations to check what the effect will be.

Another measure of influence is Cook’s distance, Di, which is a measure of the squared distance between the least squares estimate b based on all n points in the sample and the least squares estimate b(i) obtained when the ith point is deleted. It turns out that Di may be calculated as

2 2 di var(ˆyi) di hii Di = = i=1, . . . , n. p var(ei) p (1 − hii)

The above formula shows that Cook’s distance Di may take a large value either if the standardized residual di is large or the leverage hii is large.

9