Chapter 3 Multiple Linear Regression Model We consider the problem of regression when the study variable depends on more than one explanatory or independent variables, called a multiple linear regression model. This model generalizes the simple linear regression in two ways. It allows the mean function E()y to depend on more than one explanatory variables and to have shapes other than straight lines, although it does not allow for arbitrary shapes.
The linear model: Let y denotes the dependent (or study) variable that is linearly related to k independent (or explanatory)
variables XX12, ,..., Xk through the parameters 12, ,..., k and we write
yX11 X 2 2 ... Xkk .
This is called the multiple linear regression model. The parameters 12, ,..., k are the regression
coefficients associated with X12,XX ,..., k respectively and is the random error component reflecting the difference between the observed and fitted linear relationship. There can be various reasons for such difference, e.g., the joint effect of those variables not included in the model, random factors which can not be accounted for in the model etc.
th th Note that the j regression coefficient j represents the expected change in y per unit change in the j
independent variable X j . Assuming E() 0, E()y j . X j
Linear model: y E()y A model is said to be linear when it is linear in parameters. In such a case (or equivalently ) j j should not depend on any 's . For example,
i) yX01 is a linear model as it is linear in the parameters.
1 ii) y 0 X can be written as logyX log log 01 ** * yx01 * which is linear in the parameter 0 and 1 , but nonlinear is variables yyxx * log , * log . So it is a linear model. Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur 1 2 iii) yXX01 2
is linear in parameters 01,and 2 but it is nonlinear is variables X. So it is a linear model
1 iv) y 0 X 2 is nonlinear in the parameters and variables both. So it is a nonlinear model.
2 v) y 01X is nonlinear in the parameters and variables both. So it is a nonlinear model.
23 vi) y 01XX 2 3 X is a cubic polynomial model which can be written as
y 01XX 223 3 X 23 which is linear in the parameters 0123,,, and linear in the variables X12XX,, X X 3 X. So it is a linear model.
Example: The income and education of a person are related. It is expected that, on average, a higher level of education provides higher income. So a simple linear regression model can be expressed as
income 01 education .
Not that 1 reflects the change in income with respect to per unit change in education and 0 reflects the income when education is zero as it is expected that even an illiterate person can also have some income.
Further, this model neglects that most people have higher income when they are older than when they are young, regardless of education. So 1 will over-state the marginal impact of education. If age and education are positively correlated, then the regression model will associate all the observed increase in income with an increase in education. So a better model is
income 01 education 2 age . Often it is observed that the income tends to rise less rapidly in the later earning years than is early years. To accommodate such a possibility, we might extend the model to
2 income 01 education 2 age 3 age This is how we proceed for regression modeling in real-life situation. One needs to consider the experimental condition and the phenomenon before making the decision on how many, why and how to choose the dependent and independent variables.
Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur 2 Model set up: Let an experiment be conducted n times, and the data is obtained as follows: Observation number Response Explanatory variables y X1 X 2 X k
1 y1 x11xx 12 1k 2 y x xx 2 21 22 2k y x xx n n nnnk12
Assuming that the model is
yXXX01122 ... kk , the n-tuples of observations are also assumed to follow the same model. Thus they satisfy
yxxx10111212 ... kk 11 yxxx ... 20121222kk 2 2
yxxxnnnknkn01122 ... . These n equations can be written as
y111 xx11 12 x 1k 0 yxxx1 2212222k 1 ynn1 xxnn12 x nkk or yX . In general, the model with k explanatory variables can be expressed as yX where yyyy (12 , ,...,n )' is a n1 vector of n observation on study variable,
xx11 12 x 1k xx x X 21 22 2k xxnn12 x nk is a nk matrix of n observations on each of the k explanatory variables, (12 , ,..., k )' is a k 1 vector of regression coefficients and (,12 ,...,)' n is a n1 vector of random error components or disturbance term.
If intercept term is present, take first column of X to be (1,1,…,1)’.
Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur 3 Assumptions in multiple linear regression model Some assumptions are needed in the model yX for drawing the statistical inferences. The following assumptions are made: (i) E ( ) 0
2 (ii) E(') In (iii) Rank() X k (iv) X is a non-stochastic matrix
2 (v) ~(0,)NIn .
These assumptions are used to study the statistical properties of the estimator of regression coefficients. The following assumption is required to study, particularly the large sample properties of the estimators.
XX' (vi) lim exists and is a non-stochastic and nonsingular matrix (with finite elements). n n
The explanatory variables can also be stochastic in some cases. We assume that X is non-stochastic unless stated separately.
We consider the problems of estimation and testing of hypothesis on regression coefficient vector under the stated assumption.
Estimation of parameters: A general procedure for the estimation of regression coefficient vector is to minimize
nn MMyxxx()iiiiikk (11 2 2 ... ) ii11 for a suitably chosen function M . Some examples of choice of M are M ()xx
M ()xx 2
M ()xx p , in general.
We consider the principle of least square which is related to M ()xx 2 and method of maximum likelihood estimation for the estimation of parameters.
Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur 4 Principle of ordinary least squares (OLS) Let B be the set of all possible vectors . If there is no further information, the B is k -dimensional real
Euclidean space. The object is to find a vector bbbb' (12 , ,...,k ) from B that minimizes the sum of squared
deviations of i 's , i.e.,
n 2 S() i ' (y X )'(y X ) i1 for given y and X . A minimum will always exist as S() is a real-valued, convex and differentiable function. Write S() yy ' 'XX ' 2' X 'y . Differentiate S() with respect to S() 2'XX 2' Xy
2S() 2XX ' (atleast non-negative definite). 2 The normal equation is S() 0 X ''Xb X y where the following result is used: Result: If f ()zZAZ ' is a quadratic form, Z is a m1 vector and A is any mm symmetric matrix then Fz() 2 Az. z
Since it is assumed that rank ()X k (full rank), then X ' X is a positive definite and unique solution of the normal equation is bXXXy (')1 ' which is termed as ordinary least squares estimator (OLSE) of .
2S() Since is at least non-negative definite, so b minimize S() . 2
Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur 5 In case, X is not of full rank, then
bXXXyIXXXX(')'(')' where (XX ' ) is the generalized inverse of XX ' and is an arbitrary vector. The generalized inverse
(')XX of XX' satisfies
X '(')XXX XX ' XX ' XXX(') XX ' X XXXX'(') X ' X ' Theorem: (i) Let yˆ Xb be the empirical predictor of y . Then yˆ has the same value for all solutions b of X ''.Xb X y (ii) S() attains the minimum for any solution of X ''.Xb X y Proof: (i) Let b be any member in
bXXXyIXXXX(') ' (') ' . Since X (')XX XX ' X , so then
Xb X (') X X X ' y X I (') X X X ' X =X (XX ' ) X ' y which is independent of . This implies that yˆ has the same value for all solution b of X ''.Xb X y (ii) Note that for any ,
SyXbXbyXbXb() () () ()()()'()2()()yXbyXbbXXbbXyXb ()()()'()yXbyXbb XXb (Using'') XXbXy ()()()yXbyXbSb yy'2' yXbbXXb '' yy''' bXXb yy''. yyˆˆ
Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur 6 Fitted values:
If ˆ is any estimator of for the model yX , then the fitted values are defined as
yXˆ ˆ where ˆ is any estimator of .
In the case of ˆ b, yXbˆ X (')XX1 Xy ' Hy
where H XXX(')1 X ' is termed as Hat matrix which is (i) symmetric (ii) idempotent (i.e., HHH ) and
11 (iii) tr H tr X()' X X X tr X '(') X X X tr Ik k .
Residuals The difference between the observed and fitted values of the study variable is called as residual. It is denoted as eyy ~ ˆ yyˆ yXb
yHy ()I Hy Hy where H IH.
Note that (i) H is a symmetric matrix (ii) H is an idempotent matrix, i.e., HHIHIHIHH()()() and
(iii) trH trIn trH(). n k
Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur 7 Properties of OLSE (i) Estimation error: The estimation error of b is bXXXy(')1 ' (')XX1 X '( X ) (')XX1 X '
(ii) Bias Since X is assumed to be nonstochastic and E() 0
Eb()(')'() X X1 X E
0. Thus OLSE is an unbiased estimator of .
(iii) Covariance matrix The covariance matrix of b is Vb()()()' Eb b
11 EXXX(') ' '(') XXX (')XX11 XE '(')(') XXX 21(')XX XIXXX ' (') 1 21(').XX
(iv) Variance
The variance of b can be obtained as the sum of variances of all bb12, ,..., bk which is the trace of covariance matrix of b . Thus Var() b tr V () b k 2 Eb()ii i1 k Var(). bi i1
Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur 8 Estimation of 2 The least-squares criterion can not be used to estimate 2 because 2 does not appear in S ( ) . Since
22 2 E()i , so we attempt with residuals ei to estimate as follows: eyyˆ yXXXXy(')1 '
[I XXX (')1 X '] y Hy. Consider the residual sum of squares
n 2 SSrse e i i1 ee' ()'()yXbyXb yI'( H )( I Hy ) yI'( Hy ) yHy'. Also
SSrse ()'() y Xb y Xb yy' 2'' bXy bXXb '' yy'''(Using' bXy XXbX ') y
SSrse y' Hy ()'()XHX '(Using0)HHX
Since ~(0,)NI2 , so yNX~( ,2 I ). Hence yHy'~( 2 n k ).
Thus EyHy[' ] ( n k ) 2
yHy' 2 or E nk
2 or EMS rse SS where MS rse is the mean sum of squares due to residual. rse nk Thus an unbiased estimator of 2 is
22 ˆ MSsrse (say) which is a model-dependent estimator.
Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur 9 Variance of yˆ The variance of yˆ is Vy()ˆ VXb ( ) XV() b X '
21X (')XX X ' 2 H.
Gauss-Markov Theorem: The ordinary least squares estimator (OLSE) is the best linear unbiased estimator (BLUE) of . Proof: The OLSE of is
bXXXy (')1 ' which is a linear function of y . Consider the arbitrary linear estimator
bay* ' of linear parametric function ' where the elements of a are arbitrary constants.
Then for b* , E()bEayaX* (') '
and so b* is an unbiased estimator of ' when
Eb()* aX ' '
aX''. Since we wish to consider only those estimators that are linear and unbiased, so we restrict ourselves to those estimators for which aX''.
Further Var(') a y a ' Var () y a 2 a ' a Var(') b ' Var () b 21aXX'(') X Xa '. Consider
21 Var(') a y Var (') b a ' a a ' X ( X ' X ) X ' a 21 aI'(')' XXX Xa 2aI'( Ha ) .
Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur 10 Since (I H ) is a positive semi-definite matrix, so Var(') a y Var (') b 0.
This reveals that if b* is any linear unbiased estimator then its variance must be no smaller than that of b . Consequently b is the best linear unbiased estimator, where ‘best’ refers to the fact that b is efficient within the class of linear and unbiased estimators.
Maximum likelihood estimation: In the model, yX , it is assumed that the errors are normally and independently distributed with constant variance 2 or ~(0,).NI2 The normal density function for the errors is
112 f ()ii expin 1,2,...,.. 2 2 2
The likelihood function is the joint density of 12, ,..., n given as
n 2 Lf(, ) () i i1 11n exp 2 2/2n 2 i (2 ) 2 i1
11 exp ' (22/2 )n 2 2 11 exp (yX )'( yX ) . (22/2 )n 2 2 Since the log transformation is monotonic, so we maximize lnL ( , 2 ) instead of L(, 2 ). n 1 lnLy ( ,22 ) ln(2 ) ( X )'(y X ) . 22 2 The maximum likelihood estimators (m.l.e.) of and 2 are obtained by equating the first-order
derivatives of lnL ( , 2 ) with respect to and 2 to zero as follows:
lnL ( ,2 ) 1 2'(XyX )0 2 2
lnLn ( ,2 ) 1 ()'().yX yX 222222() The likelihood equations are given by X ''XX y 1 2 ()'().yX yX n Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur 11 Since rank(X ) k , so that the unique m.l.e. of and 2 are obtained as
(')XX1 Xy ' 1 2 ()'().yX yX n
Further to verify that these values maximize the likelihood function, we find 22lnL ( , ) 1 XX' 22 22lnLn ( , ) 1 ()'()yX yX 222() 2 4 6 22lnL ( , ) 1 XyX'( ). 24
Thus the Hessian matrix of second-order partial derivatives of lnL ( , 2 ) with respect to and 2 is
2222lnLL ( , ) ln ( , ) 22 2222lnLL ( , ) ln ( , ) 2222 ()
2 which is negative definite at and 2 . This ensures that the likelihood function is maximized at these values.
Comparing with OLSEs, we find that (i) OLSE and m.l.e. of are same. So m.l.e. of is also an unbiased estimator of . nk (ii) OLSE of 2 is s2 which is related to m.l.e. of 2 as 22 s . So m.l.e. of 2 is a n biased estimator of 2 .
Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur 12 Consistency of estimators (i) Consistency of b :
XX' Under the assumption that lim exists as a nonstochastic and nonsingular matrix (with finite n n elements), we have
1 2 1'XX limVb ( ) lim nn nn 1 21lim n n 0. This implies that OLSE converges to in quadratic mean. Thus OLSE is a consistent estimator of . This holds true for maximum likelihood estimators also.
The same conclusion can also be proved using the concept of convergence in probability. ˆ An estimator n converges to in probability if
ˆ limP n 0 for any 0 n ˆ and is denoted as plim(n ) . The consistency of OLSE can be obtained under the weaker assumption that XX' plim* . n exists and is a nonsingular and nonstochastic matrix such that X ' plim 0. n Since bXXX (')1 ' 1 XX'' X . nn So 1 XX'' X plim(b ) plim plim nn 1 * .0 0. Thus b is a consistent estimator of . Same is true for m.l.e. also. Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur 13 (ii) Consistency of s2 Now we look at the consistency of s2 as an estimate of 2 as 1 see2 ' nk 1 'H nk 1 1 k 1 1 ' '(')XXX X ' nn 11 kXXXX '' ' ' 1. nnnn n
n ' 1 22 Note that consists of i and {i ,in 1,2,..., } is a sequence of independently and identically n n i1 distributed random variables with mean 2 . Using the law of large numbers
' 2 plim n 11 ''XXX X ' 'X' XX X ' plim plim plim plim nn n n n n 1 0.* .0 0
212 plim(s ) (1 0) 0 2.
Thus s2 is a consistent estimator of 2 . The same holds true for m.l.e. also.
Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur 14 Cramer-Rao lower bound
Let (,2 )'. Assume that both and 2 are unknown. If E()ˆ , then the Cramer-Rao lower bound for ˆ is grater than or equal to the matrix inverse of 2 lnL ( ) IE() ' lnLL ( ,22 ) ln ( , ) EE22 lnLL ( ,22 ) ln ( , ) EE2222 () XX''() X y X EE24 ()'yX X n ()'( yX y X ) EE 446 2 XX' 0 2 . n 0 2 4 Then 21(')XX 0 1 I() 2 4 0 n is the Cramer-Rao lower bound matrix of and 2.
The covariance matrix of OLSEs of and 2 is
21(')XX 0 4 OLS 2 0 nk which means that the Cramer-Rao have bound is attained for the covariance of b but not for s2 .
Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur 15 Standardized regression coefficients: ˆ Usually, it is difficult to compare the regression coefficients because the magnitude of j reflects the units
th of measurement of j explanatory variable X j . For example, in the following fitted regression model
yXˆ 512 1000 X , ˆˆ y is measured in litres, X1 in litres and X 2 in millilitres. Although 21 but the effect of both explanatory variables is identical. One litre change in either X12and X when another variable is held fixed produces the same change is yˆ .
Sometimes it is helpful to work with scaled explanatory variables and study variable that produces dimensionless regression coefficients. These dimensionless regression coefficients are called as standardized regression coefficients.
There are two popular approaches for scaling, which gives standardized regression coefficients. We discuss them as follows:
1. Unit normal scaling: Employ unit normal scaling to each explanatory variable and study variable. So define xx zinjkij j , 1,2,..., , 1,2,..., ij s j * yyi yi sy n n 221 221 th where sxxjijj()and syyyi() are the sample variances of j explanatory n 1 i1 n 1 i1 variable and study variable, respectively.
All scaled explanatory variable and scaled study variable has mean zero and sample variance unity, i.e., using these new variables, the regression model becomes
* yii11zz 2 i 2 ... kiki z , i 1,2,..., n .
Such centering removes the intercept term from the model. The least-squares estimate of (,12 ,...,)' k is ˆ (')Z ZZy1* ' . This scaling has a similarity to standardizing a normal random variable, i.e., observation minus its mean and divided by its standard deviation. So it is called as a unit normal scaling.
Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur 16 2. Unit length scaling: In unit length scaling, define xx ij j ,injk 1,2,..., ; 1,2,..., ij S1/2 jj 0 yyi yi 1/2 SST
n 2 th where Sxxjj() ij j is the corrected sum of squares for j explanatory variables X j and i1 n 2 SSSTT() yy i is the total sum of squares. In this scaling, each new explanatory variable Wj has i1 n n 1 2 mean jij 0 and length ()1.ij j n i1 i1
In terms of these variables, the regression model is
o yiii11 2 2 ... kiki ,in 1,2,..., .
The least-squares estimate of the regression coefficient (12 , ,..., k )' is
ˆ (')WW10 W 'y . In such a case, the matrix WW' is in the form of the correlation matrix, i.e.,
1 rr12 13 r 1k rrr1 12 23 2k WW'1 r r r 13 23 3k rrr123kkk 1 where n ()()xxxxui i uj j u1 Sij rij 1/2 1/2 ()SSii jj () SS ii jj is the simple correlation coefficient between the explanatory variables X i and X j . Similarly
o Wy' ( r12yy , r ,..., r ky )' where n ()()xxyyuj j u u1 Siy rjy 1/2 1/2 ()SSSjj T () SSS jj T
th is the simple correlation coefficient between the j explanatory variable X j and study variable y .
Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur 17 Note that it is customary to refer rrijand jy as correlation coefficient though X i 's are not random variable. If unit normal scaling is used, then Z '(1)'.ZnWW
So the estimates of regression coefficient in unit normal scaling (i.e., ˆ) and unit length scaling (i.e., ˆ) are identical. So it does not matter which scaling is used, so ˆ ˆ .
The regression coefficients obtained after such scaling, viz., ˆ or ˆ usually called standardized regression coefficients.
The relationship between the original and standardized regression coefficients is
1/2 SS bjkˆ T ,1,2,..., jj S jj and
k by0 bxj j j1 where b0 is the OLSE of intercept term and bj are the OLSE of slope parameters.
The model in deviation form The multiple linear regression model can also be expressed in the deviation form. First, all the data is expressed in terms of deviations from the sample mean. The estimation of regression parameters is performed in two steps: First step: Estimate the slope parameters. Second step : Estimate the intercept term. The multiple linear regression model in deviation form is expressed as follows: Let 1 AI' n where 1,1, ...,1 ' is an 1 vector of each element unity. So
10 0 11 1 01 01 11 1 A . n 00 1 11 1 Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur 18 Then
y1 n y 112 1 yy i 1,1, ...,1 ' y nni1 n yn
Ayy y y12 yyy, ,..., yn y '. Thus pre-multiplication of any column vector by A produces a vector showing those observations in deviation form:
Note that 1 A ' n 1 .n n 0 and A is a symmetric and idempotent matrix.
In the model yX , the OLSE of is
bXXX ''1 y and the residual vector is e y Xb.
Note that Aee .
If the nk matrix is partitioned as * X XX12
* where Xn1 1,1,...,1 ' is 1 vector with all elements unity, X 2 is nk 1 matrix of observations of
* k 1 explanatory variables X 23,XX ,..., k and OLSE bbb 12, ' is suitably partitioned with OLSE of
intercept term 1 as bb12and as a k 1 1 vector of OLSEs associated with 23, ,...,k . Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur 19 Then
** y Xb11 X 2 b 2 e. Premultiply by A,
** Ay AX11 b AX 2 b 2 Ae
** AX22 b e.
* Premultiply by X 2 gives
***** X 22222''Ay XAXbXe '
*** XAXb222'.
Since A is symmetric and idempotent,
**** AXA2222''y AX AX b .. This equation can be compared with the normal equations Xy'' XXb in the model yX . Such a comparison yields the following conclusions:
* b2 is the sub vector of OLSE. Ay is the study variables vector in deviation form.
* AX 2 is the explanatory variable matrix in deviation form. This is the normal equation in terms of deviations. Its solution gives OLS of slope coefficients as
1 bAXAXAXA**** ''.y 2222 The estimate of the intercept term is obtained in the second step as follows: 1 Premultiplying yXbeby' gives n 11 1 ''yXbe ' nn n
b1 b yXXX1...02 23 k bk
bybXbX12233 ... bXkk . Now we explain various sums of squares in terms of this model.
Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur 20 The expression of the total sum of squares (TSS) remains the same as earlier and is given by TSS y '.Ay Since
** Ay AX22 b e
** yAy'' yAXb22 ye '
** Xb e'' AX22 b y e
** ** ** X11bXbeAXbXbXbee 22'' 22 11 22
** ** bX22'' AXbee 22 '
TSS SSreg SS res where the sum of squares due to regression is
** ** SSreg b22'' X AX 22 b and the sum of squares due to residual is
SSres e' e.
Testing of hypothesis: There are several important questions which can be answered through the test of hypothesis concerning the regression coefficients. For example 1. What is the overall adequacy of the model? 2. Which specific explanatory variables seem to be important? etc.
In order the answer such questions, we first develop the test of hypothesis for a general framework, viz., general linear hypothesis. Then several tests of hypothesis can be derived as its special cases. So first, we discuss the test of a general linear hypothesis.
Test of hypothesis for H0 : Rr We consider a general linear hypothesis that the parameters in are contained in a subspace of parameter space for which R r, where R is ()Jk a matrix of known elements and r is a (1J ) vector of known elements.
Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur 21 In general, the null hypothesis
H0 : Rr is termed as general linear hypothesis and
H1 : Rr is the alternative hypothesis.
We assume that rank ()R J , i.e., full rank so that there is no linear dependence in the hypothesis.
Some special cases and interesting example of H0 : Rr are as follows:
(i) H0 :0i Choose JrR1, 0, [0,0,...,0,1,0,...,0] where 1 occurs at the ith position is R .
This particular hypothesis explains whether X i has any effect on the linear model or not.
(ii) HH03:or:0 4 03 4 Choose Jr 1, 0, R [0,0,1, 1,0,...,0]
(iii) H0345:
or H03:0,0 4 3 5
0 0 1 1 0 0 ... 0 Choose Jr2, (0,0) ', R . 0 0 1 0 1 0 ... 0
(iv) H03:52 4 Choose Jr1, 2, R 0,0,1,5,0...0
(v) H02: 3 ...k 0 Jk1 r (0,0,...,0)' 0 1 0 ... 0 0 0 0 1 ... 0 0 I R k 1 . 000 ... 1 (1)kk 0
This particular hypothesis explains the goodness of fit. It tells whether i has a linear effect or not and are they of any importance. It also tests that XX23, ,..., Xk have no influence in the determination of y . Here
1 0 is excluded because this involves additional implication that the mean level of y is zero. Our main concern is to know whether the explanatory variables help to explain the variation in y around its mean value or not. Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur 22 We develop the likelihood ratio test for H0 :.Rr
Likelihood ratio test: The likelihood ratio test statistic is
maxLyXL ( ,2 | , )ˆ ( ) maxLyXRr ( ,2 | , , ) Lˆ() where is the whole parametric space and is the sample space.
If both the likelihoods are maximized, one constrained, and the other unconstrained, then the value of the unconstrained will not be smaller than the value of the constrained. Hence 1.
First, we discuss the likelihood ratio test for a more straightforward case when
RIk and r00 , ie. ., . This will give us a better and detailed understanding of the minor details, and then we generalize it for R r , in general.
Likelihood ratio test for H00: Let the null hypothesis related to k 1 vector is
H00:
where 0 is specified by the investigator. The elements of 0 can take on any value, including zero. The concerned alternative hypothesis is
H10:. Since ~(0,)NI2 in yX,so yNX ~ ( ,2 I ). Thus the whole parametric space and sample space are and respectively given by
22 : ( , ) : i , 0,ik 1, 2,...,
22 :( , ):0 , 0. The unconstrained likelihood under . 11 LyX(, 2 |, ) exp ( yXyX )'( ). (22/2 )n 2 2
Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur 23 This is maximized over when (')XX1 Xy ' 1 2 ()'().yX yX n where and 2 are the maximum likelihood estimates of and 2 which are the values maximizing the likelihood function. LLyXˆ() max ,2 |, ) 1()'()yX yX exp n 2(yX )'( yX ) 2 2 ()'()yX yX n n
n/2 n n exp 2 n . n/2 2 (2 ) (yX )'( yX ) The constrained likelihood under is
ˆ 2 11 LLyX() max(, |, ,000 ) exp ( yXyX )'( ). (22/2 )n 2 2
Since 0 is known, so the constrained likelihood function has an optimum variance estimator 1 2 ()'()yX yX n 00
n/2 n n exp ˆ 2 L() n/2 . n/2 ) (2 ) (yX00 )'( yX The likelihood ratio is
n/2 nnexp( / 2) n/2 n/2 Lˆ() (2 ) (yX )'( yX ) Lˆ() n/2 nnexp( / 2) n/2 (2 )n/2 (yX )'( yX ) 00 n/2 ()'()yX yX 00 ()'()yX yX
2 n/2 n/2 2
Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur 24 where ()'()yX yX 00 ()'()yX yX is the ratio of the quadratic forms. Now we simplify the numerator in as follows: (yX )'()( yX yX )( X )( yX )( X ) 00 0 0 1 (yX )'( yX )2' yIXXXXX (') ' ( 00 )( )''( XX 0 ) ()'()()''().yX yX 00 XX Thus ()'()()''()yX yX XX 00 ()'()yX yX ()''()XX 1 00 ()'()yX yX ()''()XX or 1 00 0 ()'()yX yX where 0.0
Distribution of ratio of quadratic forms
Now we find the distribution of the quadratic forms involved is 0 to find the distribution of 0 as follows:
()'()'yX yX ee
1 yI'(')' XXX X y yHy'
()'()XHX 'HHX (using 0) ()nkˆ 2
2 Result: If Z is a n1 random vector that is distributed as NI(0, n ) and A is any symmetric idempotent ZAZ' nn matrix of rank, p then ~(). 2 p If B is another nn symmetric idempotent matrix of rank 2 ZBZ' q , then ~() 2 q . If AB 0 then Z ' AZ is distributed independently of Z '.BZ 2 So using this result, we have yHy'() n kˆ 2 ~( 2 nk ). 22
Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur 25 Further, if H0 is true, then 0 and we have the numerator in 0 . Rewriting the numerator in 0 , in general, we have ()''()'(')'(')'X XXXXXXXXX11 '(')XXX1 X ' ' H where H is an idempotent matrix with rank k . Thus using this result, we have '''(')'HXXXX1 ~(). 2 k 22 Furthermore, the product of the quadratic form matrices in the numerator (' H )and denominator ( 'H ) of 0 is
11 1 11 I XXX(') X ' XXX (') X ' XXX (') X ' XXX (') XXXX '(') X '0 2 and hence the random variables in the numerator and denominator of 0 are independent. Dividing each of the 2 random variables by their respective degrees of freedom
()''()00XX 2 k 1 2 ()nk ˆ 2 nk ()''()XX 00 kˆ 2 ()'()()'()yX yX yX yX 00 kˆ 2
~Fkn ( , k ) under H0 . Note that (yX )'( yX ) : Restricted error sum of squares 00 (yX )'( yX ) : Unrestricted error sum of squares
Numerator in 1 : Difference between the restricted and unrestricted error sum of squares.
The decision rule is to reject H00: at level of significance whenever
1 Fknk (, )
where Fknk (, ) is the upper critical points on the central F -distribution with k and nk degrees of freedom. Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur 26 Likelihood ratio test for H0 : Rr
The same logic and reasons used in the development of the likelihood ratio test for H00: can be extended to develop the likelihood ratio test for H0 : Rr as follows.
22 ( , ) : i , 0,ik 1,2,...,
22 (, ): i ,Rr , 0.
Let (')X XX1 '.y . Then ER() R VR() ER ( )()'' R RV()' R 21RX(') X R '.
21 Since ~NXX , ( ' ) 21 soRNRRXXR ~ , ( ' ) ' 21 RrRR R ( ) ~ N 0, RXXR ( ' ) ' .
1 1 There exists a matrix Q such that R(')XX R ' QQ ' and then
2 QR()(0,) b N In . Therefore under HR0 :0,so r
'(RrQQRr )''( ) 22 1 ()'(')'()RrRXXRRr1 = 2 1 ()''(')'()RRXX1 R R 2 1 '(')X XX11 R ' RXX (') R ' RXX (') 1 X ' 2 ~ 2 (J ).
111 1 which is obtained as X (')XX R ' RXX (') R ' RXX (') X ' is an idempotent matrix, and its trace is J which is the associated degrees of freedom.
Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur 27 Also, irrespective of whether H0 is true or not,
eeyX'( )'() yX yHynk ' ( ) ˆ 2 ~( 2 nk ). 22 22 Moreover, the product of quadratic form matrices of ee' and
1 1 ()''(')'() RRXX R R is zero implying that both the quadratic forms are independent. So in terms of likelihood ratio test statistic
1 ()'(')'()RrRXXR1 R r 2 J 1 ()nk ˆ 2 2 nk 1 1 RrRXXR)' ( ' ) ' R r Jˆ 2
~FJn ( , k ) under H0 .
So the decision rule is to reject H0 whenever
1 FJnk (, )
where FJnk (, ) is the upper critical points on the central F distribution with J and (nk ) degrees of freedom.
Test of significance of regression (Analysis of variance)
If we set RIr[0 k 1 ], 0, then the hypothesis H0 : Rr reduces to the following null hypothesis:
H02: 3 ...k 0 against the alternative hypothesis
H1 :0 j for at least one j 2,3,...,k This hypothesis determines if there is a linear relationship between y and any set of the explanatory variables X 23,XX ,...,k . Notice that X1 corresponds to the intercept term in the model and hence xi1 1 for all in 1,2,..., .
This is an overall or global test of model adequacy. Rejection of the null hypothesis indicates that at least one of the explanatory variables among X 23,XX ,...,k . contributes significantly to the model. This is called as analysis of variance. Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur 28 Since ~(0,),NI2
soyNX ~ ( ,2 I )
121 bXXXyN (') '~ , (') XX . SS Also ˆ 2 res nk ()'()yyyyˆˆ nk 1 yI'(')' XXX X y yHy'''' yy bX y . nk nk nk Since (')XX-1 XH ' 0, so b and ˆ 2 are independently distributed.
Since yHy'' H and H is an idempotent matrix, so
2 SSrse()~ nk , i.e., central 2 distribution with ()nk degrees of freedom.
* * Partition XXX [,12 ] where the submatrix X 2 contains the explanatory variables X 23,XX ,..., k
* * and partition [,12 ]where the subvector 2 contains the regression coefficients 23, ,...,k .
Now partition the total sum of squares due to y 's as
SST y' Ay
SSrgee SS rs
** ** whereSSrge2222 b ' X ' AX b is the sum of squares due to regression and the sum of squares due to residuals is given by
SSrse ( y Xb )'( y Xb ) yHy '
SSTrg SS e . Further SS ** ** ** ** rge 2222''XAX 2222 '' XAX 22 22~k 1 , i.e., non-central distribution with non centrality parameter 2 , 2 2
** ** ** ** SST 2222'' X AX 22 22''XAX 22 22~n1 , i.e., non-central distribution with non centrality parameter 2 . 2 2
Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur 29 Since X 2eeHSSSS 0, sorg and rs are independently distributed. The mean squares due to regression is SS MS rge rge k 1 and the mean square due to error is SS MS rse . res nk Then MS ** ** reg 22''XAX 22 ~ Fknk1, 2 MSres 2 which is a non-central F -distribution with (1,)knk degrees of freedom and noncentrality parameter
**''XAX ** 22 22. 2 2
Under H02:..., 3 k
MSreg FF ~.knk1, MSres
The decision rule is to reject at level of significance whenever
FFk (1,). nk
The calculation of F -statistic can be summarized in the form of an analysis of variance (ANOVA) table given as follows:
Source of variation Sum of squares Degrees of freedom Mean squares F
Regression SSrge k 1 MSSSkreg re g /1 F Error nk SSrse MSSSnkres re s /( )
Total SST n 1
Rejection of H0 indicates that it is likely that atleast one i 0 (ik 1,2,..., ).
Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur 30 Test of hypothesis on individual regression coefficients In case, if the test in analysis of variance is rejected, then another question arises is that which of the regression coefficients is/are responsible for the rejection of the null hypothesis. The explanatory variables corresponding to such regression coefficients are important for the model.
Adding such explanatory variables also increases the variance of fitted values yˆ , so one needs to be careful that only those regressors are added that are of real value in explaining the response. Adding unimportant explanatory variables may increase the residual mean square, which may decrease the usefulness of the model.
To test the null hypothesis
H0 :0 j versus the alternative hypothesis
H1 :0 j has already been discussed is the case of a simple linear regression model. In the present case, if H0 is accepted, it implies that the explanatory variable X j can be deleted from the model. The corresponding test statistic is
bj ttnkH~ ( 1) under 0 se() bj where the standard error of OLSE bj of j is
2 th 1 se() bj ˆ C jj where C jj denotes the j diagonal element of (')X X corresponding to bj .
The decision rule is to reject H0 at level of significance if
tt . ,1nk 2 ˆ Note that this is only a partial or marginal test because j depends on all the other explanatory variables
X i (i j that are in the model. This is a test of the contribution of X j given the other explanatory variables in the model.
Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur 31 Confidence interval estimation The confidence intervals in a multiple regression model can be constructed for individual regression coefficients as well as jointly. We consider both of them as follows:
Confidence interval on the individual regression coefficient:
2 Assuming i 's are identically and independently distributed following N(0, ) in yX , we have
yNX~( ,2 I )
bN~(,(')).21 XX
Thus the marginal distribution of any regression coefficient estimate
2 bNj ~(,jjj C )
th 1 where C jj is the j diagonal element of (')X X .
Thus b ttnkHjjj~ ( ) under , 1,2,... j 2 0 ˆ C jj
SS yy''' bX y where ˆ 2 rse . nk nk
So the 100(1 )% confidence interval for j (j 1,2,...,k ) is obtained as follows:
b jj Pt t 1 ,,nk2 nk 22ˆ C jj
22 Pbjjjjjjj tˆˆ C b t C 1. ,,nk nk 22
Thus the confidence interval is
22 btjjjjjjˆˆ Cbt,. C ,,nk nk 22
Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur 32 Simultaneous confidence intervals on regression coefficients: A set of confidence intervals that are true simultaneously with probability (1 ) are called simultaneous or joint confidence intervals. It is relatively easy to define a joint confidence region for in multiple regression model. Since ()''()bXXb ~ Fkn, k kMSrse
()''()bXXb PFknk (, ) 1 . kMSrse So a 100(1 )% joint confidence region for all of the parameters in is ()''()bXXb Fknk (, ) kMSrse which describes an elliptically shaped region.
Coefficient of determination ()R2 and adjusted R2
Let R be the multiple correlation coefficient between y , and X12,XX ,...,k . Then square of multiple correlation coefficient ()R2 is called a coefficient of determination. The value of R2 commonly describes how well the sample regression line fits the observed data. This is also treated as a measure of goodness of fit of the model.
Assuming that the intercept term is present in the model as
yXXXuiniiikiki12233 ... , 1,2,..., then
2 ee' R 1 n 2 ()yyi i1 SS SS 1 res rge SSTT SS where
SSrse : sum of squares due to residuals,
SST : total sum of squares
SSrge : sum of squares due to regression. Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur 33 R2 measures the explanatory power of the model, which in turn reflects the goodness of fit of the model. It reflects the model adequacy in the sense of how much is the explanatory power of the explanatory variables. Since
1 ee ' y ' I X ( X ' X ) X ' y yHy ' , nn 222 (yyii ) yny , ii11 11n whereyyy i ' with 1,1,...,1 ', yyyy 12 , ,...,n ' nni1
Thus n 1 (yy )2 yyn ' ' yy ' i 2 i1 n 1 yy ' y ' ' y n yy''(')' y1 y
1 yI'(')' y yAy' where AI(')1 '.
yHy' So R2 1. yAy'
The limits of R2 are 0 and 1, i.e., 01.R2
R2 0 indicates the poorest fit of the model. R2 1 indicates the best fit of the model R2 0.95 indicates that 95% of the variation in y is explained by R2 . In simple words, the model is 95% good.
Similarly, any other value of R2 between 0 and 1 indicates the adequacy of the fitted model.
Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur 34 Adjusted R2 If more explanatory variables are added to the model, then R2 increases. In case the variables are irrelevant, then R2 will still increase and gives an overly optimistic picture.
With a purpose of correction in the overly optimistic picture, adjusted R2 , denoted as R 2 or adj R2 is used which is defined as SS/( n k ) R 2 1 rse SS/( n 1) T n 1 2 1(1). R nk
We will see later that ()nk and (1)n are the degrees of freedom associated with the distributions of SSres SS SS and SS . Moreover, the quantities rse and T are based on the unbiased estimators of respective T nk n 1 variances of e and y in the context of analysis of variance.
The adjusted R2 will decline if the addition if an extra variable produces too small a reduction in (1 R2 ) to
n 1 compensate for the increase in . nk
Another limitation of adjusted R2 is that it can be negative also. For example, if kn3, 10, R2 0.16, then 9 R 2 10.970.250 7 which has no interpretation.
Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur 35 Limitations 1. If the constant term is absent in the model, then R2 can not be defined. In such cases, R2 can be negative. Some ad-hoc measures based on R2 for regression line through origin have been proposed in the literature.
Reason that why R2 is valid only in linear models with intercept term: In the model yX , the ordinary least squares estimator of is bXXX (')1 'y . Consider the fitted model as yXbyXb()
Xb e where e is the residual. Note that ylyXbely
yelyˆ where yXbˆ is the fitted value and l (1,1,...,1)' is a n1 vector of elements unity. The total sum of
n 2 squares TSS() yi y is then obtained as i1 TSS()'()[()]'[()] y ly y ly yˆˆ ly e y ly e (ylyylyeeˆˆ )'( ) ' 2(ylye ˆ )'
SSreg SS res 2( Xb ly )'e (because yˆ Xb ) SSSSyle2 ' (because Xe ' 0). reg res
The Fisher Cochran theorem requires TSS SSreg SS res to hold true in the context of analysis of
2 variance and further to define the R . In order that TSS SSreg SS res holds true, we need that
le' should be zero, i.e. le' ='( l y yˆ ) 0 which is possible only when there is an intercept term in the model. We show this claim as follows:
First, we consider a no intercept simple linear regression model yxiii 1 , (in 1,2,..., ) where the n
xyii nn n parameter is estimated as b* i1 . Then le' = e ( y yˆ ) ( y bx* ) 0,in general. 1 1 n iiiii 1 2 ii11 i 1 xi i1 Similarly, in a no intercept multiple linear regression model yX , we find that
le' ='( l y yˆ ) l '( X Xb ) = lXb ' ( ) l ' 0, in general. Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur 36 Next, we consider a simple linear regression model with intercept term yxiii 01, (in 1,2,..., )
sxy where the parameters 0 and 1 are estimated as b01 y bx and b1 respectively, where sxx
n n n n 2 1 1 sxxyyxy()(), i i sxxxx(), i x xi yy i . We find that i1 i1 n i1 n i1
nn ˆ le' = eiii ( y y ) ii11 n (ybbxii01 ) i1 n (yybxbxii11 ) i1 n [(yybxxii )1 ( )] i1 nn (yyii ) b1 ( xx ) ii11 0.
In a multiple linear regression model with an intercept term ylX where the parameters 0 0 ˆ 1 and are estimated as 0 y bx and bXXXy (') ', respectively. We find that le' ='( l y yˆ ) ˆ =ly '(0 Xb ) =ly '( y XbXb ) , =l '(yy )lX '( Xb ) =0. Thus we conclude that for the Fisher Cochran to hold true in the sense that the total sum of squares can be divided into two orthogonal components, viz., the sum of squares due to regression and sum of squares due to errors, it is necessary that le' ='( l y yˆ ) 0holds and which is possible only when the intercept term is present in the model.
2. R2 is sensitive to extreme values, so R2 lacks robustness.
3. R2 always increases with an increase in the number of explanatory variables in the model. The main drawback of this property is that even when the irrelevant explanatory variables are added in the model, R2 still increases. This indicates that the model is getting better, which is not really correct. Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur 37 4. Consider a situation where we have the following two models: y XXuin ... , 1,2,.., iikiki122 logyXiikiki122 ... Xv The question is now which model is better? For the first model,
n ˆ 2 ()yyii 2 i1 R1 1 n 2 ()yyi i1 and for the second model, an option is to define R2 as
n ˆ 2 (logyyii log ) 2 i1 R2 1 n . 2 (logyyi log ) i1 22 As such R12and R are not comparable. If still, the two models are needed to be compared, a better proposition to define R2 can be as follows:
n ˆ* (log)yii anti y 2 i1 R3 1 n 2 ()yyi i1 * 22 where yyii log . Now R13and R on the comparison may give an idea about the adequacy of the two models.
Relationship of analysis of variance test and coefficient of determination
Assuming the 1 to be an intercept term, then for H02:...0, 3k the F statistic in analysis of variance test is MS F rge MSres ()nk SS rge (1)kSS rse
nk SSrge kSSSS1 Trge
SSrge 2 nkSST nk R 2 kkR111SS 1 rge SST Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur 38 where R2 is the coefficient of determination. So F and R2 are closely related. When R2 0, then F 0.
In the limit, when RF2 1, . So both F and R2 vary directly. Larger R2 implies greater F value. That is why the F test under the analysis of variance is termed as the measure of the overall significance of estimated regression. It is also a test of significance of R2 . If F is highly significant, it implies that we can reject H0 , i.e. y is linearly related to Xs'.
Prediction of values of study variable The prediction in the multiple regression model has two aspects 1. Prediction of the average value of study variable or mean response. 2. Prediction of the actual value of the study variable.
1. Prediction of average value of y
We need to predict Ey()at a given xxxx001020 (, ,...,)'.k The predictor as a point estimate is p xb x ( X ' X )1 X ' y 00 Ep ( ) x0 . So p is an unbiased predictor for Ey(). Its variance is Var ( p ) E p E (y ) ' pE (y )
21 = xXX00 ( ' ) x Then Ey()ˆ x Eyx (|) 00 0 21 Var()yˆ00 xXX (') x 0
The confidence interval on the mean response at a particular point, such as x01,xx 02 ,..., 0k can be found as follows:
Define xxxx001020 ( , ,...,k )'. The fitted value at x0 is yˆ00 xb . Then yEyxˆ00 (| ) Pt t 1 ,,nk21 nk 22ˆ xXX(') x 00
21 21 Pyˆˆ0000000 t xXX(') x Eyx (|) y ˆˆ t x (') XX x 1 . ,,nk nk 22 Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur 39 The 100 (1 )% confidence interval on the mean response at the point x01,xx 02 ,..., 0k , i.e., E(/yx0 ) is
21 21 ytˆˆ000000 xXXxyt(') , ˆ ˆ xXXx (') . ,,nk nk 22
2. Prediction of actual value of y
We need to predict y at a given xxxx001020 (, ,...,)'.k The predictor as a point estimate is
pxbf 0
Ep (f ) x0
Sopyf is an unbiased for . It's variance is
Var ( pfff ) E ( p y )( p y )' 21 1xXX00 ( ' ) x . The 100 (1 )% confidence interval for this future observation is
21 21 pff tˆˆ[1 xXX00 ( ' ) x ] , p t [1 xXX 00 ( ' ) x ] . ,,nk nk 22
Regression Analysis | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur 40