STATISTICAL ESTIMATION IN MULTIVARIATE

WITH A BLOCK OF MISSING OBSERVATIONS

by

YI LIU

Presented to the Faculty of the Graduate School of

The University of Texas at Arlington in Partial Fulfillment

of the Requirements

for the Degree of

DOCTOR OF PHILOSOPHY

THE UNIVERSITY OF TEXAS AT ARLINGTON

DECEMBER 2017

Copyright © by YI LIU 2017

All Rights Reserved

ii

To Alex and My Parents

iii Acknowledgements

I would like to thank my supervisor, Dr. Chien-Pai Han, for his instructing, guiding and supporting me over the years. You have set an example of excellence as a researcher, mentor, instructor and role model.

I would like to thank Dr. Shan Sun-Mitchell for her continuously encouraging and instructing. You are both a good teacher and helpful friend. I would like to thank my thesis committee members Dr. Suvra Pal and Dr. Jonghyun Yun for their discussion, ideas and feedback which are invaluable.

I would like to thank the graduate advisor, Dr. Hristo Kojouharov, for his instructing, help and patience. I would like to thank the chairman Dr. Jianzhong Su, Dr.

Minerva Cordero-Epperson, Lona Donnelly, Libby Carroll and other staffs for their help.

I would like to thank my manager, Robert Schermerhorn, for his understanding, encouraging and supporting which make this happen. I would like to thank my employer

Sabre and my coworkers for their support in the past two years.

I would like to thank my husband Alex for his encouraging and supporting over all these years. In particularly, I would like to thank my parents -- without the inspiration, drive and support that you have given me, I might not be the person I am today.

October 11, 2017

iv Abstract

STATISTICAL ESTIMATION IN MULTIVARIATE NORMAL DISTRIBUTION

WITH A BLOCK OF MISSING OBSERVATIONS

YI LIU, Ph.D.

The University of Texas at Arlington, 2017

Supervising Professor: Chien-Pai Han

Missing observations occur quite often in data analysis. We study a random sample from a multivariate normal distribution with a block of missing observations, here the observations missing is not at random. We use maximum likelihood method to obtain the from such a sample. The properties of the estimators are derived. The prediction problem is considered when the response variable has missing values. The of the mean estimators of the response variable under with and without extra are compared. We prove that the of the mean of the response variable using all data is smaller than that we do not consider extra information, when the correlation between response variable and predictors meets some conditions.

We derive three kinds of prediction interval for the future observation. An example of a college admission data is used to obtain the estimators for the bivariate and multivariate situations.

v Table of Contents

Acknowledgements ...... iv

Abstract ...... v

List of Figures ...... ix

List of Tables ...... x

Chapter 1 INTRODUCTION ...... 1

Chapter 2 STATISTICAL ESTIMATION IN BIVARIATE NORMAL

DISTRIBUTION WITH A BLOCK OF MISSING OBSERVATIONS ...... 5

2.1 Maximum Likelihood Estimators ...... 7

2.2 Properties of the Maximum Likelihood Estimators ...... 10

2.2.1 Estimator of the Mean of 푋 ...... 10

2.2.2 Estimator of the Variance of 푋 ...... 11

2.2.3 Estimator of the Regression Coefficient 훽 ...... 12

2.2.3.1 The Density Function of 훽 ...... 15

2.2.3.2 Plot of the Density Function of 훽 ...... 16

2.2.4 Estimator of the Mean of Y ...... 18

2.2.4.1 Comparison of the Variances...... 21

2.2.4.2 Simulation on Variances ...... 23

2.2.5 Estimator of the Conditional Variance of Y given x ...... 25

2.3 Fisher Information ...... 28

2.4 Prediction ...... 31

2.4.1Usual prediction interval ...... 32

2.4.2 Prediction interval ...... 34

2.4.3 Unconditional prediction interval ...... 38

2.5 An Example for the Bivariate Situation ...... 42

vi Chapter 3 STATISTICAL ESTIMATION IN MULTIPLE REGRESSION

MODEL WITH A BLOCK OF MISSING OBSERVATIONS ...... 44

3.1 Maximum Likelihood Estimators ...... 45

3.2 Properties of the Maximum Likelihood Estimators ...... 51

3.2.1 Estimator of the Mean Vector of 푿 ...... 51

3.2.2 Estimator of the of 푿 ...... 52

3.2.3 Estimator of the Regression Coefficient Vector 휷 ...... 53

3.2.4 Estimator of the Mean of 푌 ...... 54

3.2.5 Estimator of the Conditional Variance of Y given x ...... 57

3.3 Prediction ...... 59

3.3.1 Usual prediction interval ...... 59

3.3.2 Prediction interval ...... 62

3.3.3 Unconditional prediction interval ...... 66

3.4 An example for multiple regression model ...... 70

Chapter 4 STATISTICAL ESTIMATION IN MULTIVARIATE REGRESSION

MODEL WITH A BLOCK OF MISSING OBSERVATIONS ...... 71

4.1 Maximum Likelihood Estimators ...... 72

4.2 Properties of the Maximum Likelihood Estimators ...... 79

4.2.1 Estimator of the Mean Vector of 푿 ...... 79

4.2.2 Estimator of the Covariance Matrix of 푿 ...... 80

4.2.3 Estimator of the Regression Coefficient Matrix ...... 81

4.2.4 Estimator of the Mean Vector of 풀 ...... 83

4.2.5 Estimator of the Conditional Covariance Matrix of Y given x ...... 87

4.3 Prediction ...... 89

4.3.1 Usual prediction interval ...... 89

vii 4.3.2 Prediction interval ...... 90

4.3.3 Unconditional prediction interval ...... 91

Appendix A Statistical Estimation in Bivariate Normal Distribution without

Missing Observations ...... 92

A.1 Maximum Likelihood Estimators ...... 93

A.2 The prediction Interval ...... 95

A.2.1 Usual prediction interval ...... 95

A.2.2 Prediction interval...... 97

A.2.3 Unconditional prediction interval ...... 98

Appendix B Fisher Information Matrix for Bivariate Normal Distribution ...... 100

Appendix C Some derivation used in the dissertation ...... 108

Appendix D R codes ...... 111

D.1 R Code for the Estimators ...... 112

D.2 R Code of Simulation for Variance Comparison ...... 114

References ...... 116

Biographical Information ...... 119

viii List of Figures

Figure 2-1 Density plots for 훽 when m=50,70 and 90, respectively ...... 17

Figure 2-2 Comparison of Variance with and without Extra Information ...... 24

ix List of Tables

Table 1-1 College Admission Data ...... 4

Table 2-1 Variance Ratio for Sample Size = 100 ...... 22

Table 2-2 Variance Ratio for Sample Size = 200 ...... 23

Table 2-3 Simulation Results on Variance ...... 24

Table 2-4 Estimators with All Information ...... 42

Table 2-5 Estimators without Extra Information ...... 43

x Chapter 1

INTRODUCTION

Missing observations occur quite often in data analysis. The data may be missing on some variables for some observations. Generally, there are three kinds of missing: missing completely at random (MCAR), missing at random (MAR) and missing not at random(MNAR). There are a lot of research on how to minimize bias and get good estimates with the missing data. Allison (2002) discussed the strength and weakness of the conventional and novel methods to deal with the missing. The conventional methods include Listwise deletion, Pairwise deletion, dummy variable adjustment, and imputation such as replacement with the mean, regression and Hot Deck. The novel methods include maximum likelihood and multiple imputation. Other relevant papers include

Anderson (1957), Rubin (1976), Chung and Han (2000), Howell (2008), Little and Zhang

(2011), Han and Li (2011), Sinsomboonthong (2011), and books by Little and Rubin

(2002). Though the imputation is the most popular technique to deal with the missing, it is not appropriate in some cases. For example, for the college admission data from Han and Li (2011), the TOEFL is naturally missing for US students, and imputation does not make sense in this case.

For a random sample in multivariate normal distribution with a block of observations missing not at random, Chung and Han (2000), and Han and Li (2011) considered this situation in discriminant analysis and regression model, respectively.

Anderson (1957) considered a sample from a bivariate normal distribution with missing observations. The maximum likelihood method is used to give the estimators, but the paper did not study the properties of the estimators. Allison (2002) discussed maximum likelihood method when missing is ignorable.

1 We use maximum likelihood method to obtain estimators from a multivariate normal distribution sample with a block of observations missing not at random. We consider all available information, do not delete any data and do not impute.

We have the following random sample with a block of missing 풀 values:

푋1,1 푋1,2 ⋯ 푋1,푝 푌1,1 푌1,2 ⋯ 푌1,푞

푋2,1 푋2,2 ⋯ 푋2,푝 푌2,1 푌2,2 ⋯ 푌2,푞

⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮

푋푚,1 푋푚,2 ⋯ 푋푚,푝 푌푚,1 푌푚,2 ⋯ 푌푚,푞

푋푚+1,1 푋푚+1,2 ⋯ 푋푚+1,푝

⋮ ⋮ ⋮ ⋮

푋푛,1 푋푛,1 ⋯ 푋푛,푝

The multivariate normal probability density function (pdf) can be written as

푓(퐱, 풚) = 푔(풚|퐱)ℎ(퐱) (1 - 1)

푿 where [ ] have a multivariate normal distribution with mean vector 흁 and covariance 풀 matrix 횺

흁풙 횺퐱퐱 횺퐱퐲 흁 = [ ] 횺 = [ ] (1 - 2) 흁풚 횺퐲퐱 횺퐲퐲

푿 is a 푝 × 1 vector and 풀 is a 푞 × 1 vector. 푔(풚|퐱) is the conditional pdf of 풀 given 푿 = 퐱, and ℎ(퐱) is the marginal pdf of 푿.

The joint is

m n

L(훍퐲, 훍퐱, 훃, 횺퐱퐱, 횺퐞) = ∏ 푔푌|푿(퐲풋|퐱풋; 훍퐲, 훍퐱, 휷, 횺e) ∏ ℎ푿(퐱퐢; 훍퐱, 횺퐱퐱) (1 - 3) j=1 i=1

2 The missing data happened for 풀 in our model. We first consider a bivariate normal distribution sample (Chapter 2), i.e., 푝, 푞 = 1. The maximum likelihood estimators are derived. We prove that they are consistent and asymptotically efficient. Their distributions are obtained. We show that the unconditional variance of 휇̂푦 is smaller when all data is considered than that we do not consider extra information under a correlation condition. We also consider the prediction problem with missing observations. Three kinds of the prediction intervals for a future observation (푋0, 푌0) are derived. They are

1) Usual prediction interval for 푌0 – conditioning on 푋 = 푥 푎푛푑 푋0 = 푥0

2) Prediction interval for 푌0 – unconditional on X, but conditioning on 푋0 = 푥0

3) Unconditional prediction interval for 푌0

Then we extend to multiple regression model (Chapter 3), i.e., 푿 is a 푝 × 1 vector. Again, we study the properties of the maximum likelihood estimators and derive the three kinds of prediction intervals. Finally, we extend to multivariate regression model

(Chapter 4), i.e., 푿 is a 푝 × 1 vector and 풀 is a 푞 × 1 vector. We obtain the maximum likelihood estimators, study their properties and derive the three kinds of prediction interval for each response variable which follows the multiple regression model.

The comparison of the unconditional variance of 휇̂푦 with and without extra information, and density plots for 훽̂ are simulated by R Studio. An example of the college admission data from Han and Li (2011) (Table 1 -1) is used to obtain the estimators for the bivariate and multivariate situations. In this data set, TOEFL scores are required for students whose native language is not English, but missing for students whose native language is English such as US students. The missing values should not be imputed, as these values do not exist.

3

Table 1-1 College Admission Data

GRE GRE GRE GRE GRE GRE Obs Obs Verbal Quantitative Analytic TOEFL Verbal Quantitative Analytic TOEFL 1 420 800 600 497 21 250 730 460 513 2 330 710 380 563 22 320 760 610 560 3 270 700 340 510 23 360 720 525 540 4 400 710 600 563 24 370 780 500 500 5 280 800 450 543 25 300 630 380 507 6 310 660 425 507 26 390 580 370 587 7 360 620 590 537 27 380 770 500 520 8 220 530 340 543 28 370 640 200 520 9 350 770 560 580 29 340 800 540 517 10 360 750 440 577 30 460 750 560 597 11 440 700 630 NA 31 630 540 600 NA 12 640 520 610 NA 32 350 690 620 NA 13 480 550 560 NA 33 480 610 480 NA 14 550 630 630 NA 34 630 410 530 NA 15 450 660 630 NA 35 550 450 500 NA 16 410 410 340 NA 36 510 690 730 NA 17 460 610 560 NA 37 640 720 520 NA 18 580 580 610 NA 38 440 580 620 NA 19 450 540 570 NA 39 350 430 480 NA 20 420 630 660 NA 40 480 700 670 NA

4

Chapter 2

STATISTICAL ESTIMATION IN BIVARIATE NORMAL DISTRIBUTION

WITH A BLOCK OF MISSING OBSERVATIONS

X μ푥 Let [ ] have a bivariate normal distribution with mean vector [ ] and covariance Y μ푦

2 휎푥 휎푥푦 matrix [ 2 ]. 휎푦푥 휎푦

Suppose the following random sample with a block of missing Y values are obtained:

푋1 푌1

푋2 푌2

⋮ ⋮

푋푚 푌푚

푋푚+1

푋푛

Based on the data, we would like to estimate the . We can write the bivariate normal probability density function (pdf) as

푓(푥, 푦) = 푔(푦|푥)ℎ(푥) (2 - 1)

Where 푔(푦|푥) is the conditional pdf of 푌 given 푋 = 푥, and ℎ(푥) is the marginal pdf of 푋.

2 1 1 2 푔푌|푋(푦푗|푥푗; 휇푦, 휇푥, 훽, 휎푒 ) = 푒푥 푝 {− 2 [푦푗 − 퐸(푦푗|푥푗)] } √2휋휎푒 2휎푒

5 1 1 2 = 푒푥 푝 [− 2 (푦푗 − 휇푦 − 훽(푥푗 − 휇푥)) ] (2 - 2) √2휋휎푒 2휎푒

2 1 1 2 ℎ푋(푥푖; 휇푥, 휎푥 ) = 푒푥 푝 [− 2 (푥푖 − 휇푥) ] (2 - 3) √2휋휎푥 2휎푥 where 푗 = 1,2, … , 푚. 푖 = 1,2, … , 푛.

휎푦푥 퐸(푦푗|푥푗) = 휇푦 + 2 (푥푗 − 휇푥) = 휇푦 + 훽(푥푗 − 휇푥) (2 - 4) 휎푥

휎푦푥 훽 = 2 (2 - 5) 휎푥

휎2 2 2 푥푦 2 2 2 (2 - 6) 휎푒 = 휎푦 − 2 = 휎푦 − 훽 휎푥 휎푥

The joint likelihood function is

푚 푛 2 2 2 2 퐿(휇푦, 휇푥, 훽, 휎푥 , 휎푒 ) = ∏ 푔푌|푋(푦푗|푥푗; 휇푦, 휇푥, 훽, 휎푒 ) ∏ ℎ푋(푥푖; 휇푥, 휎푥 ) (2 - 7) 푗=1 푖=1

푛+푚 2 2 − −푚 −푛 퐿(휇푦, 휇푥, 훽, 휎푥 , 휎푒 ) = (2휋) 2 휎푒 휎푥

1 푚 2 1 푛 2 ∙ 푒푥푝 {− 2 ∑푗=1(푦푗 − 휇푦 − 훽(푥푗 − 휇푥)) − 2 ∑푖=1(푥푖 − 휇푥) } 2휎푒 2휎푥

Since the log of the joint likelihood function is more convenient to use and there is no loss of information in using it, also maximizing the likelihood is equivalent maximizing the log likelihood since the log is a monotone increasing function. Therefore, we take the log of the joint likelihood function and denote it as

2 2 2 2 l(휇푦, 휇푥, 훽, 휎푥 , 휎푒 ) = 푙푛 (퐿(휇푦, 휇푥, 훽, 휎푥 , 휎푒 ))

푛 + 푚 푙(휇 , 휇 , 훽, 휎2, 휎2) = − 푙푛(2휋) − 푚푙푛(휎 ) − 푛푙푛(휎 ) 푦 푥 푥 푒 2 푒 푥

푚 푛 2 1 1 2 − 2 ∑ (푦푗 − 휇푦 − 훽(푥푗 − 휇푥)) − 2 ∑(푥푖 − 휇푥) (2 - 8) 2휎푒 2휎푥 푗=1 푖=1

6

2.1 Maximum Likelihood Estimators

We derive the maximum likelihood estimators (MLE) by taking the derivatives of the log likelihood function (2 − 8) with respect to each and setting it to zero.

Solving the estimating equations simultaneously, we obtain the MLE.

Taking the derivative of (2 − 8) with respective to 휇푦 results in

2 2 푚 휕푙(휇푦, 휇푥, 훽, 휎푥 , 휎푒 ) 1 = 2 ∑(푦푗 − 휇푦 − 훽(푥푗 − 휇푥)) 휕휇푦 휎푒 푗=1

Set

휕푙(휇 , 휇 , 훽, 휎2, 휎2) 푦 푥 푥 푒 = 0 휕휇푦

Then we have the estimating equation

∑[푦푗 − 휇푦 − 훽(푥푗 − 휇푥)] = 0 (2 - 9) 푗=1

Taking the derivative of (2 − 8) with respective to 휇푥 results in

2 2 푚 푛 휕푙(휇푦, 휇푥, 훽, 휎푥 , 휎푒 ) 훽 1 = − 2 ∑[푦푗 − 휇푦 − 훽(푥푗 − 휇푥)] + 2 ∑(푥푖 − 휇푥) 휕휇푥 휎푒 휎푥 푗=1 푖=1

Set

휕푙(휇 , 휇 , 훽, 휎2, 휎2) 푦 푥 푥 푒 = 0 휕휇푥

Then

푚 푛 훽 1 − 2 ∑[푦푗 − 휇푦 − 훽(푥푗 − 휇푥)] + 2 ∑(푥푖 − 휇푥) = 0 (2 - 10) 휎푒 휎푥 푗=1 푖=1

Taking the derivative of (2 − 8) with respective to 훽 results in

7

2 2 푚 휕푙(휇푦, 휇푥, 훽, 휎푥 , 휎푒 ) 1 = 2 ∑ (푦푗 − 휇푦 − 훽(푥푗 − 휇푥)) (푥푗 − 휇푥) 휕훽 휎푒 푗=1

Set

휕푙(휇 , 휇 , 훽, 휎2, 휎2) 푦 푥 푥 푒 = 0 휕훽

Then

푚 1 2 ∑ (푦푗 − 휇푦 − 훽(푥푗 − 휇푥)) (푥푗 − 휇푥) = 0 (2 - 11) 휎푒 푗=1

2 Taking the derivative of (2 − 8) with respective to 휎푥 results in

2 2 푛 휕푙(휇푦, 휇푥, 훽, 휎푥 , 휎푒 ) 푛 1 2 2 = − 2 + 4 ∑(푥푖 − 휇푥) 휕휎푥 2휎푥 2휎푥 푖=1

Set

2 2 휕푙(휇푦, 휇푥, 훽, 휎푥 , 휎푒 ) 2 = 0 휕휎푥

Then

푛 푛 1 2 − 2 + 4 ∑(푥푖 − 휇푥) = 0 (2 - 12) 2휎푥 2휎푥 푖=1

2 Taking the derivative of (2 − 8) with respective to 휎푒 results in

2 2 푚 휕푙(휇푦, 휇푥, 훽, 휎푥 , 휎푒 ) 푚 1 2 2 = − 2 + 4 ∑(푦푗 − 휇푦 − 훽(푥푗 − 휇푥)) 휕휎푒 2휎푒 2휎푒 푗=1

Set

2 2 휕푙(휇푦, 휇푥, 훽, 휎푥 , 휎푒 ) 2 = 0 휕휎푒

8 Then

푚 푚 1 2 − 2 + 4 ∑ (푦푗 − 휇푦 − 훽(푥푗 − 휇푥)) = 0 (2 - 13) 2휎푒 2휎푒 푗=1

Simultaneously solve estimating equations (2 − 9) to (2 − 13), we obtain the following maximum likelihood estimators:

푛 1 휇̂ = ∑ 푋 = 푋̅ (2 - 14) 푥 푛 푖 푛 푖=1

̂ 휇̂푦 = 푌̅푚 − 훽(푋̅푚 − 푋̅푛) (2 - 15)

푚 ̅ ̅ ∑푗=1(푌푗 − 푌푚)(푋푗 − 푋푚) 훽̂ = (2 - 16) 푚 ̅ 2 ∑푗=1(푋푗 − 푋푚)

푛 1 휎̂2 = ∑(푋 − 푋̅ )2 (2 - 17) 푥 푛 푖 푛 푖=1

푚 1 휎̂2 = ∑[(푌 − 푌̅ ) − 훽̂(푋 − 푋̅ )]2 (2 - 18) 푒 푚 푗 푚 푗 푚 푗=1

where 푚 푚 푛 1 1 1 푌̅ = ∑ 푌 푋̅ = ∑ 푋 푋̅ = ∑ 푋 푚 푚 푗 푚 푚 푗 푛 푛 푗 푗=1 푗=1 푗=1

Similarly, if we do not consider extra information 푋푚+1, 푋푚+2, … 푋푛 and only use the first 푚 observations, we have

9 푚 1 휇̂ = ∑ 푋 = 푋̅ (2 - 19) 푥_푛표 푚 푗 푚 푗=1

휇̂푦_푛표 = 푌̅푚 (2 - 20)

∑푚 (푌 − 푌̅ )(푋 − 푋̅ ) ̂ 푗=1 푗 푚 푗 푚 훽푛표 = (2 - 21) 푚 ̅ 2 ∑푗=1(푋푗 − 푋푚)

푚 1 2 휎̂2 = ∑(푋 − 푋̅ ) 푥_푛표 푚 푗 푚 푗=1 (2 - 22)

푚 1 σ̂2 = ∑[(푌 − 푌̅ ) − 훽̂(푋 − 푋̅ )]2 (2 - 23) 푒_푛표 푚 푗 푚 푗 푚 푗=1

2.2 Properties of the Maximum Likelihood Estimators

2.2.1 Estimator of the Mean of 푋

The expectation of 휇̂푥 is

푛 푛 1 1 1 E(휇̂ ) = E(푋̅ ) = 퐸 { ∑ 푋 } = ∑ E(푋 ) = ∙ 푛휇 = 휇 (2 - 24) 푥 푛 푛 푖 푛 푖 푛 푥 푥 푖=1 푖=1

So, 휇̂푥 is an unbiased estimator.

The variance of 휇̂푥 is

푛 푛 푛 1 1 1 Var(휇̂ ) = Var(푋̅ ) = Var { ∑ 푋 } = Var {∑ 푋 } = {∑ Var(푋 )} 푥 푛 푛 푖 푛2 푖 푛2 푖 푖=1 푖=1 푖=1

10 1 1 = ∙ 푛휎2 = 휎2 (2 - 25) 푛2 푥 푛 푥

Since 푋푖 is normal, so is 푋̅푛, hence

1 휇̂ ~ 푁 (휇 , 휎2) 푥 푥 푛 푥

2.2.2 Estimator of the Variance of 푋

2 The expectation of 휎̂푥 is

푛 푛 푛 1 1 휎2 (푋 − 푋̅ )2 2 ̅ 2 ̅ 2 푥 푖 푛 퐸(휎̂푥 ) = E { ∑(푋푖 − 푋푛) } = E {∑(푋푖 − 푋푛) } = E {∑ 2 } 푛 푛 푛 휎푥 푖=1 푖=1 푖=1

휎2 푛−1 = 푥 ∙ (푛 − 1) = 휎2 (2 - 26) 푛 푛 푥

where 푛 ( ̅ )2 푋푖 − 푋푛 2 ∑ 2 ~ 휒 (푛 − 1) (2 - 27) 휎푥 푖=1

2 2 휎̂푥 is a biased estimator. The bias of 휎̂푥 is

1 Bias(휎̂2, 휎2) = 퐸(휎̂2) − 휎2 = − 휎2 푥 푥 푥 푥 푛 푥

The bias vanishes as 푛 → ∞.

If we define

푛 푛 1 S2 = 휎̂2 = ∑(푋 − 푋̅ )2 (2 - 28) xn 푛 − 1 푥 푛 − 1 푖 푛 푖=1

Then

푛 퐸(S2 ) = 퐸(휎̂2) = 휎2 (2 - 29) xn 푛 − 1 푥 푥

2 2 Sxn is an unbiased estimator for 휎̂푥 .

2 The variance of 휎̂푥 is

11 푛 푛 푛 1 1 휎4 (푋 − 푋̅ )2 2 ̅ 2 ̅ 2 푥 푖 푛 푉푎푟(휎̂푥 ) = Var { ∑(푋푖 − 푋푛) } = 2 Var {∑(푋푖 − 푋푛) } = 2 Var {∑ 2 } 푛 푛 푛 휎푥 푖=1 푖=1 푖=1

휎4 2(푛−1) = 푥 ∙ 2(푛 − 1) = 휎4 (2 - 30) 푛2 푛2 푥

2 The mean squared error for 휎̂푥 is

2 2 2 2 2 2 2 2 푀푆퐸(휎̂푥 ) = 퐸(휎̂푥 − 휎푥 ) = Var(휎̂푥 ) + [Bias(휎̂푥 , 휎푥 )]

2(푛−1) 1 (2푛−1) = 휎4 + 휎4 = 휎4 (2 - 31) 푛2 푥 푛2 푥 푛2 푥

2 The distribution of 휎̂푥 is the Chi-Square distribution and

2 푛휎̂푥 2 2 ~ 휒 (푛 − 1) 휎푥

2.2.3 Estimator of the Regression Coefficient 훽̂

Since the formula for 훽̂ involves 푋 and 푌, we will derive the conditional expectation and variance of 훽̂ given 푋 = x first, then we derive its unconditional expectation and variance.

The conditional expectation of 훽̂ given 푋 = x is

푚 푚 ∑ (푌푗 − 푌̅푚)(푥푗 − 푥̅푚) ∑ 푌푗(푥푗 − 푥̅푚) E(훽̂|푥) = E { 푗=1 |푥} = E { 푗=1 |푥} 푚 2 푚 2 ∑푗=1(푥푗 − 푥̅푚) ∑푗=1(푥푗 − 푥̅푚)

푚 E{∑ 푌푗(푥푗 − 푥̅푚) |푥} = 푗=1 푚 2 ∑푗=1(푥푗 − 푥̅푚)

푚 2 훽 ∑푗=1(푥푗−푥̅푚) = 푚 2 = 훽 (2 - 32) ∑푗=1(푥푗−푥̅푚) where

E(푌푗|푥푗) = 휇푦 + 훽(푥푗 − 휇푥) (2 - 33)

12 푚 푚 푚 E{∑푗=1 푌푗(푥푗 − 푥̅푚) |푥} = ∑ E[푌푗(푥푗 − 푥̅푚) |푥] = ∑[E(푌푗 |푥)(푥푗 − 푥̅푚)] 푗=1 푗=1

= ∑{[휇푦 + 훽(푥푗 − 휇푥) ](푥푗 − 푥̅푚)} 푗=1

푚 푚 푚

= 휇푦 ∑(푥푗 − 푥̅푚) + 훽 ∑ 푥푗(푥푗 − 푥̅푚) − 훽휇푥 ∑ (푥푗 − 푥̅푚) 푗=1 푗=1 푗=1

= 훽 ∑ 푥푗(푥푗 − 푥̅푚) 푗=1

푚 2 = 훽 ∑(푥푗 − 푥̅푚) 푗=1

The unconditional expectation of 훽̂ is

퐸(훽̂) = 퐸[퐸(훽̂|푋)] = 훽 (2 - 34)

훽̂ is an unbiased estimator.

Similarly, the conditional variance of 훽̂ given 푋 = x is

푚 푚 ∑ (푌푗 − 푌̅푚)(푥푗 − 푥̅푚) ∑ 푌푗(푥푗 − 푥̅푚) Var(훽̂|푥) = Var { 푗=1 |푥} = Var { 푗=1 |푥} 푚 2 푚 2 ∑푗=1(푥푗 − 푥̅푚) ∑푗=1(푥푗 − 푥̅푚)

푚 푚 Var{∑푗=1 푌푗(푥푗 − 푥̅푚) |푥} ∑푗=1 Var[푌푗(푥푗 − 푥̅푚)|푥 ] = 2 = 2 푚 2 푚 2 [∑푗=1(푥푗 − 푥̅푚) ] [∑푗=1(푥푗 − 푥̅푚) ]

푚 2 푚 2 2 ∑푗=1(푥푗 − 푥̅푚) Var(푌푗|푥) ∑푗=1(푥푗 − 푥̅푚) 휎푒 = 2 = 2 푚 2 푚 2 [∑푗=1(푥푗 − 푥̅푚) ] [∑푗=1(푥푗 − 푥̅푚) ]

2 푚 2 휎푒 ∑푗=1(푥푗 − 푥̅푚) = 2 푚 2 [∑푗=1(푥푗 − 푥̅푚) ]

2 휎푒 = 푚 2 (2 - 35) ∑푗=1(푥푗−푥̅푚)

To obtain the unconditional variance of 훽̂, we use the Law of Total Variance

13 Var(훽̂) = E (Var(훽̂|푋)) + Var(E(훽̂|푋)) now

Var (E(훽̂|푋)) = Var(훽) = 0

휎2 휎2 휎2 ̂ 푒 푒 푥 퐸 (Var(훽|푋)) = 퐸 ( 2) = 2 퐸 ( 2) 푚 ̅ 휎푥 푚 ̅ ∑푗=1(푋푗 − 푋푚) ∑푗=1(푋푗 − 푋푚)

휎2 In order to obtain 퐸 ( 푥 ), let 푚 ̅ 2 ∑푗=1(푋푗−푋푚)

푚 2 (푋푗 − 푋̅푚) 푈 = ∑ 2 휎푥 푗=1

Then

푈 ~ 휒2(푚 − 1)

So

∞ 푚−1 푢 1 1 1 −1 − E ( ) = ∫ ∙ 푢 2 푒 2푑푢 푈 푢 푚 − 1 푚−1 0 Γ ( ) 2 2 2

∞ 푚−3 푢 1 1 −1 − = ∫ ∙ 푢 2 푒 2푑푢 푚 − 3 푚 − 3 푚−3 0 ∙ 2 훤 ( ) 2 2 2 2

∞ 푚−3 푢 1 1 −1 − = ∫ 푢 2 푒 2푑푢 푚 − 3 푚 − 3 푚−3 0 Γ ( ) 2 2 2

1 = 푓표푟 푚 > 3 푚 − 3

Hence, we have

휎2 1 휎2 1 휎2 ̂ 푒 푒 푒 E (Var(훽|푋)) = 2 E ( ) = 2 ∙ = 2 휎푥 푈 휎푥 푚 − 3 (푚 − 3)휎푥

The unconditional variance of 훽̂ is

14 휎2 휎2 ̂ 푒 푒 (2 - 36) Var(훽) = 2 + 0 = 2 푓표푟 푚 > 3 (푚 − 3)휎푥 (푚 − 3)휎푥

2.2.3.1 The Density Function of 훽̂

According to Kendall and Stuart (1945) , the density function of 훽̂ is

푚 푚−1 푚 푚−1 2 2 − Γ ( ) 휎푦 (1 − 휌 ) 2 휎 휎 2 2 푓(훽̂) = 2 [ 푦 (1 − 휌2) + (훽̂ − 휌 푦) ] 푚 − 1 2 휋Γ ( ) 휎푚−1 휎푥 휎푥 √ 2 푥

푚 푚 2 − Γ ( ) 훽̂ − 훽 2 = 2 [1 + ( ) ] 푚 − 1 푎 휋Γ ( ) 푎 √ 2

푚 2 − 1 훽̂ − 훽 2 = [1 + ( ) ] (2 - 37) 푚 − 1 1 푎B ( , ) 푎 2 2 where

푚 − 1 1 휎 휎2 푚 − 1 1 Γ ( ) Γ ( ) 훽 = 휌 푦 , 푎 = √ 푦 − 훽2 , B ( , ) = 2 2 (2 - 38) 2 푚 휎푥 휎푥 2 2 Γ ( ) 2

휎 It is Pearson Type VII distribution, symmetrical about the point 훽 = 휌 푦 . 휎푥

By (2 − 6) , (2 − 36), and (2 − 38), we have

휎2 휎2 − 훽2휎2 1 휎2 푎2 ̂ 푒 푦 푥 푦 2 Var(훽) = 2 = 2 = ( 2 − 훽 ) = (2 - 39) (푚 − 3)휎푥 (푚 − 3)휎푥 푚 − 3 휎푥 푚 − 3

Let

푚 푎2 푎2 k = , σ2 = Var(훽̂) = = (2 - 40) 2 푚 − 3 2푘 − 3

If 훽 and σ are held constant as k → ∞, we have

15 2 −푘 1 훽̂ − 훽 lim 푓(훽̂) = lim [1 + ( ) ] 푘→∞ 푘→∞ 1 1 σ 2푘 − 3B (푘 − , ) σ√2푘 − 3 √ 2 2

2 −푘 훽̂ − 훽 ( ) 1 Γ(푘) σ = lim lim 1 + 1 푘→∞ 푘→∞ σ 2Γ ( ) 3 1 2푘 − 3 √ 2 √k − Γ (푘 − ) [ 2 2 ] [ ]

2 1 1 훽̂ − 훽 = ∙ 1 ∙ exp [− ( ) ] (2 - 41) σ√2π 2 휎

This is the density of a normal distribution with mean 훽 and variance σ2 . So, 훽̂ has an asymptotically normal distribution when sample is large.

휎2 ̂ 푒 훽 ~̇ 푁 (훽, 2 ) (2 - 42) (푚 − 3)휎푥

2.2.3.2 Plot of the Density Function of 훽̂

Suppose we have the following bivariate normal distribution:

휇푦 2 ( ) = ( ) 휇푥 5

2 휎푦 휎푥푦 4 6 Σ = ( 2 ) = ( ) (2 - 43) 휎푥푦 휎푥 6 16

Then

휎푦 휎푥푦 6 훽 = 휌 = 2 = = 0.375 휎푥 휎푥 16

휎2 4 6 2 √7 √ 푦 2 √ 푎 = 2 − 훽 = − ( ) = = 0.3307 휎푥 16 16 8

For given 푚, we have

푚 − 1 1 49 1 퐵 ( , ) = 퐵 ( , ) = 0.3599 푓표푟 푚 = 50 2 2 2 2

16 푚 − 1 1 69 1 퐵 ( , ) = 퐵 ( , ) = 0.3029 푓표푟 푚 = 70 2 2 2 2

푚 − 1 1 89 1 퐵 ( , ) = 퐵 ( , ) = 0.2664 푓표푟 푚 = 90 2 2 2 2

So

2 −25 1 훽̂ − 0.375 푓(훽̂) = [1 + ( ) ] 푓표푟 푚 = 50 0.1190 0.3307

2 −35 1 훽̂ − 0.375 푓(훽̂) = [1 + ( ) ] 푓표푟 푚 = 70 0.1002 0.3307

2 −45 1 훽̂ − 0.375 푓(훽̂) = [1 + ( ) ] 푓표푟 푚 = 90 0.0881 0.3307

Figure 2-1 Density plots for 훽̂ when m=50,70 and 90, respectively

17 2.2.4 Estimator of the Mean of Y

̂ As we do for 훽, we will derive the conditional expectation and variance of 휇̂푦 given 푋 = x first, then we derive its unconditional expectation and variance.

The conditional expectation of 휇̂푦 given 푋 = x is

̂ ̂ E(휇̂푦|푥) = 퐸{[푌̅푚 − 훽(푥̅푚 − 푥̅푛)]|푥} = 퐸(푌̅푚|푥) − (푥̅푚 − 푥̅푛)퐸(훽|푥)

= 휇푦 + 훽(푥̅푚 − 휇푥) − 훽(푥̅푚 − 푥̅푛)

= 휇푦 + 훽(푥̅푛 − 휇푥) (2 - 44)

where 푚 푚 푚 1 1 1 E(푌̅ |푥) = E ( ∑ 푌 |푥 ) = ∑ E(푌 |푥 ) = ∑[휇 + 훽(푥 − 휇 )] 푚 푚 푗 푗 푚 푗 푗 푚 푦 푗 푥 푗=1 푗=1 푗=1

푚 1 = [푚휇 + 훽 ∑ 푥 − 푚훽휇 ] = 휇 + 훽(푥̅ − 휇 ) (2 - 45) 푚 푦 푗 푥 푦 푚 푥 푗=1

The unconditional expectation of 휇̂푦 is

E(휇̂푦) = E(E(휇̂푦|푋)) = E[휇푦 + 훽(푋̅푛 − 휇푥)] = 휇푦 + 훽[E(푋̅푛) − 휇푥]

= 휇푦 + 훽(휇푥 − 휇푥) = 휇푦 (2 - 46)

휇̂푦 is an unbiased estimator.

The conditional variance of 휇̂푦 given 푋 = x is

̂ Var(휇̂푦|푥) = Var{[푌̅푚 − 훽(푥̅푚 − 푥̅푛)]|푥}

̂ ̂ = Var(푌̅푚|푥) + Var[(푥̅푚 − 푥̅푛)(훽|푥)] − 2(푥̅푚 − 푥̅푛)Cov[(푌̅푚, 훽)|푥]

2 ̂ = Var(푌̅푚|푥) + (푥̅푚 − 푥̅푛) Var[(훽|푥)] − 0

2 2 2 휎푒 휎푒 (푥̅푚 − 푥̅푛) = + (2 - 47) 푚 푚 2 ∑푗=1(xj − 푥̅푚) where

18 푚 푚 1 1 1 휎2 Var(푌̅ |푥) = Var ( ∑ 푌 |푥 ) = ∑ Var(푌 |푥 ) = ∙ 푚휎2 = 푒 (2 - 48) 푚 푚 푗 푗 푚2 푗 푗 푚2 푒 푚 푗=1 푗=1

휎2(푥̅ − 푥̅ )2 ̂ 2 ̂ 푒 푚 푛 Var[(푥̅푚 − 푥̅푛)(훽|푥)] = (푥̅푚 − 푥̅푛) Var(훽|푥) = 푚 2 (2 - 49) ∑푗=1(xj − 푥̅푚)

̂ Cov[(푌̅푚, 훽)|푥] = 0 (푆푒푒 퐴푝푝푒푛푑푖푥 퐶)

To obtain the unconditional variance of 휇̂푦 we use the Law of Total Variance

Var(휇̂푦) = E[Var(휇̂푦|푋)] + Var[E(휇̂푦|푋)] now

훽2휎2 Var[E(휇̂ |푋)] = Var[휇 + 훽(푋̅ − 휇 )] = 훽2Var(푋̅ ) = 푥 (2 - 50) 푦 푦 푛 푥 푛 푛

2 2( ̅ ̅ )2 2 ( ̅ ̅ )2 휎푒 휎푒 푋푚 − 푋푛 휎푒 2 푋푚 − 푋푛 E[Var(휇̂푦|푋)] = 퐸 ( + ) = + 휎푒 퐸 ( ) 푚 푚 ̅ 2 푚 푚 ̅ 2 ∑푗=1(Xj − 푋푚) ∑푗=1(Xj − 푋푚)

(푋̅ −푋̅ )2 To obtain 퐸 ( 푚 푛 ), we need to find out the distribution for (푋̅ − 푋̅ ). 푚 ̅ 2 푚 푛 ∑푗=1(Xj−푋푚)

Since

푛 푚 푛 1 1 푋̅ − 푋̅ = 푋̅ − ∑ 푋 = 푋̅ − (∑ 푋 + ∑ 푋 ) 푚 푛 푚 푛 푖 푚 푛 푖 푗 푖=1 푖=1 푗=푚+1

1 푛 − 푚 = 푋̅ − [푚푋̅ + (푛 − 푚)푋̅ ] = (푋̅ − 푋̅ ) (2 - 51) 푚 푛 푚 푛−푚 푛 푚 푛−푚

푋̅푚 and 푋̅푛−푚 are independent and normally distributed, so we have

E(푋̅푚 − 푋̅푛−푚) = E(푋̅푚) − E(푋̅푛−푚) = 휇푥 − 휇푥 = 0 (2 - 52)

19 푛 Var(푋̅ − 푋̅ ) = 푉푎푟(푋̅ ) + 푉푎푟(푋̅ ) = 휎2 푚 푛−푚 푚 푛−푚 푚(푛 − 푚) 푥

1 1 푛 = 휎2 + 휎2 = 휎2 (2 - 53) 푚 푥 푛 − 푚 푥 푚(푛 − 푚) 푥

Let

푚 1 2 S2 = ∑(푋 − 푋̅ ) (2 - 54) xm 푚 − 1 푗 푚 푗=1

2 2 It is known that 푋̅푚 푎푛푑 Sxm are independent, so (푋̅푚 − 푋̅푛−푚) 푎푛푑 Sxm are independent too. Hence, we have

2 2 2 (푋̅푚 − 푋̅푛) 푛 − 푚 (푋̅푚 − 푋̅푛−푚) E ( 2) = ( ) E ( 2 ) 푚 ̅ 푛 (푚 − 1)Sxm ∑푗=1(푋푗 − 푋푚)

(푋̅ − 푋̅ )2 [ 푚 푛−푚 푛 ] ⁄ 휎2 푛 − 푚 2 푛 푚(푛 − 푚) 푥 = ( ) ∙ ∙ 퐸 2 푛 푚(푛 − 푚) (푚 − 1)Sxm [ ⁄휎2] 푥 { } where

2 (푋̅ − 푋̅ ) 2 푈 = [ 푚 푛−푚 푛 ] ~ 휒 (1) ⁄ 휎2 푚(푛 − 푚) 푥

2 (푚 − 1)Sxm 2 푉 = [ ⁄ 2] ~ 휒 (푚 − 1) 휎푥

Let

푈⁄ 퐹 = 1 푉 ⁄(푚 − 1)

Then

퐹 ~ 퐹(1, 푚 − 1)

20 So

(푋̅ − 푋̅ )2 푛 − 푚 1 푛 − 푚 1 푚 − 1 E ( 푚 푛 ) = ∙ E(퐹) = ∙ ∙ 푚 ̅ 2 푚푛 푚 − 1 푚푛 푚 − 1 푚 − 3 ∑푗=1(푋푗 − 푋푚) 푛 − 푚 = 푓표푟 푚 > 3 푚푛(푚 − 3)

Hence

휎2 (푛 − 푚)휎2 휎2 푛 − 푚 E (Var(휇̂ |푋)) = 푒 + 푒 = 푒 [1 + ] 푓표푟 푚 > 3 (2 - 55) 푦 푚 푚푛(푚 − 3) 푚 푛(푚 − 3)

By (2 − 50) and (2 − 55), the unconditional variance of 휇̂푦 is

Var(휇̂푦) = E[Var(휇̂푦|푋)] + Var[E(휇̂푦|푋)]

휎2 푛 − 푚 훽2휎2 = 푒 [1 + ] + 푥 푓표푟 푚 > 3 (2 - 56) 푚 푛(푚 − 3) 푛

휇̂푦 has an asymptotically normal distribution when sample is large

휎2 푛 − 푚 훽2휎2 휇̂ ~̇ 푁 (휇 , 푒 [1 + ] + 푥 ) (2 - 57) 푦 푦 푚 푛(푚 − 3) 푛

2.2.4.1 Comparison of the Variances

When we do not consider the extra information 푋푚+1, 푋푚+2,… 푋푛 , and only use the first m observations, the unconditional variance of 휇̂푦 is given in Appendix (퐴 − 16)

휎2 훽2휎2 휎2 푉푎푟(휇̂ ) = 푒 + 푥 = 푦 푦 푛표 푚 푚 푚

From (2 − 56), we have

휎2 푛 − 푚 훽2휎2 훽2휎2 훽2휎2 Var(휇̂ ) = 푒 [1 + ] + 푥 + 푥 − 푥 푦 푚 푛(푚 − 3) 푛 푚 푚

21 휎2 훽2휎2 푛 − 푚 푛 − 푚 = 푒 + 푥 + 휎2 − 훽2휎2 푚 푚 푚푛(푚 − 3) 푒 푚푛 푥

휎2 푛 − 푚 푛 − 푚 = 푦 + 휎2 − (휎2 − 휎2) 푚 푚푛(푚 − 3) 푒 푚푛 푦 푒

휎2 푛 − 푚 1 = 푦 + [(1 + ) 휎2 − 휎2] 푚 푚푛 푚 − 3 푒 푦

휎2 푛 − 푚 1 = 푦 + [(1 + ) 휎2(1 − 휌2) − 휎2] 푠푖푛푐푒 휎2 = 휎2(1 − 휌2) 푚 푚푛 푚 − 3 푦 푦 푒 푦 푛 − 푚 = 푉푎푟(휇̂ ) + 휎2[1 − (푚 − 2)휌2] 푦 푛표 푚푛(푚 − 3) 푦

푛 − 푚 = 푉푎푟(휇̂ ) {1 + [1 − (푚 − 2)휌2]} (2 - 58) 푦 푛표 푛(푚 − 3)

2 Hence, if 1 − (푚 − 2)휌 < 0 , then 푉푎푟(휇̂푦) < 푉푎푟(휇̂푦)푛표, i.e.,

1 휌2 > (2 - 59) 푚 − 2

Define

Var(휇̂ ) 푛 − 푚 푟푎푡푖표 = 푦 = 1 + [1 − (푚 − 2)휌2] 푉푎푟(휇̂푦)푛표 푛(푚 − 3)

We have the following two tables:

Table 2-1 Variance Ratio for Sample Size = 100 n m 휌2 ratio n m 휌2 ratio n m 휌2 ratio 0.1 0.9596 0.1 0.9740 0.1 0.9910 0.2 0.9085 0.2 0.9436 0.2 0.9809 0.3 0.8574 0.3 0.9131 0.3 0.9708 0.4 0.8064 0.4 0.8827 0.4 0.9607 100 50 0.5 0.7553 100 70 0.5 0.8522 100 90 0.5 0.9506 0.6 0.7043 0.6 0.8218 0.6 0.9405 0.7 0.6532 0.7 0.7913 0.7 0.9303 0.8 0.6021 0.8 0.7609 0.8 0.9202 0.9 0.5511 0.9 0.7304 0.9 0.9101

22 Table 2-2 Variance Ratio for Sample Size = 200

n m 휌2 ratio n m 휌2 ratio n m 휌2 ratio 0.1 0.9546 0.1 0.9720 0.1 0.9905 0.2 0.9041 0.2 0.9418 0.2 0.9805 0.3 0.8536 0.3 0.9115 0.3 0.9704 0.4 0.8031 0.4 0.8813 0.4 0.9603 200 100 0.5 0.7526 200 140 0.5 0.8511 200 180 0.5 0.9503 0.6 0.7021 0.6 0.8209 0.6 0.9402 0.7 0.6515 0.7 0.7907 0.7 0.9302 0.8 0.6010 0.8 0.7604 0.8 0.9201 0.9 0.5505 0.9 0.7302 0.9 0.9101

We can see the more correlated and the more missing, the smaller the ratio, i.e.,

Var(휇̂푦) is smaller than 푉푎푟(휇̂푦)푛표.

2.2.4.2 Simulation on Variances

From (2 − 56), we have

휎2 푛 − 푚 훽2휎2 푉푎푟(휇̂ ) = 푒 [1 + ] + 푥 푦 푚 푛(푚 − 3) 푛

The estimated variance of 휇̂푦 is

S2 푛 − 푚 훽̂2S2 푉푎푟̂ (휇̂ ) = e [1 + ] + xn (2 - 60) 푦 푚 푛(푚 − 3) 푛 where

푚 1 S2 = ∑[(푌 − 푌̅ ) − 훽̂(푋 − 푋̅ )]2 (2 - 61) e 푚 − 2 푗 푚 푗 푚 푗=1

2 and Sxn is given in (2 − 28).

If we do not consider extra information 푋푚+1, 푋푚+2, … , 푋푛 and only use the first m observations,

휎2 훽2휎2 푉푎푟(휇̂ ) = 푒 + 푥 푦 푛표 푚 푚

The estimated variance is

23 S2 훽̂2S2 푉푎푟̂ (휇̂ ) = e + xm (2 - 62) 푦 푛표 푚 푚

2 2 where Sxm is given in (2 − 54) and S푒 is given in (2 − 61).

By R Studio and the following bivariate normal distributions - both meet (2 − 59):

푌 540 970 600 퐿표푤 푐표푟푟푒푙푎푡푖표 ( ) ~푁 (( ) , [ ]) (푎) 푋 2 420 600 3450 (2 - 63) 푌 540 970 1200 퐻푖푔ℎ 푐표푟푟푒푙푎푡푖표 ( ) ~푁 (( ) , [ ]) (푏) 푋 2 420 1200 3450

푛 = 40, 푚 = 20, 28, 36

We have the following results by simulating 10,000 times:

Table 2-3 Simulation Results on Variance

Low Correlation High Correlation n m ̂ 푉푎푟̂ (휇̂ ) ̂ 푉푎푟̂ (휇̂ ) 푉푎푟(휇̂푦) 푦 푛표 푉푎푟(휇̂푦) 푦 푛표 Mean SD Mean SD Mean SD Mean SD 40 20(Miss 50%) 48.40 15.39 50.76 16.48 39.63 11.41 49.85 16.15 40 28(Miss 30%) 34.76 9.34 35.85 9.82 30.93 7.72 35.41 9.67 40 36(Miss 10%) 27.41 6.53 27.70 6.66 26.26 6.09 27.41 6.57

Figure 2-2 Comparison of Variance with and without Extra Information

24 The variance considering all information is smaller. That means the confidence interval is shorter with extra information than without extra information.

2.2.5 Estimator of the Conditional Variance of Y given x

Since

푚 1 휎̂2 = ∑[(푌 − 푌̅ ) − 훽̂(푋 − 푋̅ )]2 푒 푚 푗 푚 푗 푚 푗=1 does not involve extra information 푋푚+1, 푋푚+2, … , 푋푛, it is same for both with and without extra information. So, we consider the situation for no extra information.

By (2 − 4),

E(푌푗|푥푗) = 휇푦 + 훽(푥푗 − 휇푥), 푗 = 1,2, … , 푚 (2 - 64)

For given 푥푗, we may write (2 − 64) as

푌푗 = 휇푦 + 훽(푥푗 − 휇푥) + 휀푗, 푗 = 1,2, … , 푚 (2 - 65) where

2 Var(푌푗|푥푗) = Var(휀푗) = 휎푒

E(휀푗) = E(푌푗|푥푗) − 휇푦 − 훽(푥푗 − 휇푥) = 0

Rewrite (2 − 65) as

푌푗 = 훽∗ + 훽(푥푗 − 푥̅푚) + 휀푗, 푗 = 1,2, … , 푚 (2 - 66) where

훽∗ = 휇푦 + 훽(푥̅푚 − 휇푥).

(2 − 66) is the mean corrected form of the regression model.

Let

푌 1 푥 − 푥̅ 휀 1 1 푚 β 1 푌 1 푥 − 푥̅ ∗ 휀 풀 = [ 2 ] 푿 = [ 2 푚 ] 휷 = [⋯] 휺 = [ 2 ] ⋮ 푐 ⋮ 풄 ⋮ 훽 푌푚 1 푥푚 − 푥̅푚 휀푚

25 (2 − 66) can be written as

풀 = 푿푐휷풄 + 휺 (2 - 67)

By Results 7.1, Results 7.2 and Results 7,4 in Johnson and Wichern (1998), the least square estimators are

̂ ′ −ퟏ ′ 휷푐 = (푿푐푿푐) 푿푐풀 (2 - 68)

′ −ퟏ ′ 휺̂ = 풀 − 풀̂ = [푰 − 푿푐(푿푐푿푐) 푿푐]풀 (2 - 69)

and

′ 2 2 휺̂ 휺̂~휎푒 휒푚−2 (2 - 70)

From (2 − 69), we have

′ −ퟏ ′ 휺̂ = [푰 − 푿푐(푿푐푿푐) 푿푐]풀

1 ퟎ′ 푚 ퟏ′ = 푰 − [ퟏ ⋮ 푿 ] [ ] 풀 푐2 1 푿′ ퟎ 풄2 푚 2 { [ ∑푗=1(푥푗 − 푥̅푚) ] }

ퟏퟏ′ 푿 푿′ = {푰 − − 푐2 풄2 } 풀 풎 푚 2 ∑푗=1(푥푗 − 푥̅푚)

푌 푌̅ 푥 − 푥̅ 1 푚 1 푚 푌 ̅ 푥 − 푥̅ = [ 2 ] − 푌푚 − 훽̂ [ 2 푚 ] … … … … … … 푌푚 [ 푌̅푚 ] 푥푚 − 푥̅푚

26 푌 − 푌̅ − 훽̂(푥 − 푥̅ ) 1 푚 1 푚 ̅ ̂ = 푌2 − 푌푚 − 훽(푥2 − 푥̅푚) (2 - 71) … … ̂ [푌푚 − 푌̅푚 − 훽(푥푚 − 푥̅푚)]

So,

푚 ̂ 2 휺̂′휺̂ = ∑[(푌푗 − 푌̅푚) − 훽(푥푗 − 푥̅푚)] |푥 (2 - 72) 푗=1

2 The conditional expectation of 휎̂푒 given 푋 = x is

푚 1 2 1 퐸(휎̂2|푥) = 퐸 [∑[(푌 − 푌̅ ) − 훽̂(푋 − 푋̅ )] |푥] = 퐸(휺̂′휺̂) 푒 푚 푗 푚 푗 푚 푚 푗=1

1 푚 − 2 = ∙ (푚 − 2)휎2 = 휎2 (2 - 73) 푚 푒 푚 푒

2 Then the unconditional expectation of 휎̂푒 is

푚 − 2 푚 − 2 E(휎̂2) = E[E(휎̂2|푋)] = E[ 휎2] = 휎2 (2 - 74) 푒 푒 푚 푒 푚 푒

2 2 휎̂푒 is a biased estimator. The bias of 휎̂푒 is

2 2 퐵푖푎푠(휎̂2) = 퐸(휎̂2) − 휎2 = 퐸 (− 휎2) = − 휎2 푒 푒 푒 푚 푒 푚 푒

The bias vanishes as 푚 → ∞.

2 The conditional variance of 휎̂푒 given 푋 = x is

1 1 2(푚 − 2) Var(휎̂2|푥) = Var(휺̂′휺̂) = ∙ 2(푚 − 2)휎4 = 휎4 (2 - 75) 푒 푚2 푚2 푒 푚2 푒

2 By the Law of Total Variance, we have the unconditional variance of 휎̂푒

2 2 2 Var(휎̂푒 ) = E(Var(휎̂푒 |푋)) + Var(E(휎̂푒 |푋))

27 2(푚 − 2)휎4 2(푚 − 2)휎4 = E [ 푒 ] + 0 = 푒 (2 - 76) 푚2 푚2

2 The mean squared error for 휎̂푒 is

2 2 푀푆퐸(휎̂2) = 퐸(휎̂2 − 휎2)2 = Var(휎̂2) + [퐵푖푎푠(휎̂2)]2 = Var(휎̂2) + 퐸 (− 휎2) 푒 푒 푒 푒 푒 푒 푚 푒

2(푚 − 2)휎4 4휎4 2휎4 = 푒 + 푒 = 푒 (2 - 77) 푚2 푚2 푚

Since the conditional distribution

2 푚휎̂푒 2 2 |푥 ~휒푚−2 휎푒

2 푚휎̂푒 2 does not depend on x, so the unconditional distribution of 2 is also 휒푚−2, i.e., 휎푒

푚휎̂2 푒 2 (2 - 78) 2 ~휒푚−2 휎푒

2.3 Fisher Information Matrix

Upon taking the negative expectation of the second partial derivatives with respect to parameters in (2 − 8) (See Appendix B), we obtain the following Fisher

Information Matrix

퐼 퐼 퐼 퐼 2 퐼 2 휇푦휇푦 휇푦휇푥 휇푦훽 휇푦휎푥 휇푦휎푒 퐼 퐼 퐼 퐼 2 퐼 2 휇푥휇푦 휇푥휇푥 휇푥훽 휇푥휎푥 휇푥휎푒 2 2 퐼 퐼 퐼 퐼 2 퐼 2 푰(휇푦, 휇푥, 훽, 휎푥 , 휎푒 ) = 훽휇푦 훽휇푥 훽훽 훽휎푥 훽휎푒

퐼 2 퐼 2 퐼 2 퐼 2 퐼 2 2 휎푥휇푦 휎푥휇푥 휎푥훽 휎푥푥휎푥 휎푥휎푒 퐼 2 퐼 2 퐼 2 퐼 2 2 퐼 2 2 [ 휎푒 휇푦 휎푒 휇푥 휎푒 훽 휎푒 휎푥 휎푒 휎푒 ]

28 푚 푚훽 2 − 2 0 0 0 휎푒 휎푒 푚훽 푚훽2 푛 − 2 2 + 2 0 0 0 휎푒 휎푒 휎푥 2 = 푚휎푥 (2 - 79) 0 0 2 0 0 휎푒 푛

0 0 0 4 0 2휎푥 푚 0 0 0 0 4 [ 2휎푒 ]

The inverse of the Fisher Information Matrix is

휎2 훽2휎2 훽휎2 푒 + 푥 푥 0 0 0 푚 푛 푛 훽휎2 휎2 푥 푥 0 0 0 푛 푛 휎2 푰−1(휇 , 휇 , 훽, 휎2, 휎2) = 0 0 푒 0 0 (2 - 80) 푦 푥 푥 푒 푚휎2 푥 2휎4 0 0 0 푥 0 푛 2휎4 0 0 0 0 푒 [ 푚 ]

Denote the elements on the diagonal of (2 − 80) as

휎2 훽2휎2 Var(휇̂ ) = 푒 + 푥 푦 퐹푖푠ℎ푒푟 푚 푛

휎2 Var(휇̂ ) = 푥 푥 퐹푖푠ℎ푒푟 푛

휎2 ̂ 푒 Var(훽)퐹푖푠ℎ푒푟 = 2 푚휎푥

2휎4 푉푎푟(휎̂2) = 푥 푥 퐹푖푠ℎ푒푟 푛

2휎4 Var(휎̂2) = 푒 푒 퐹푖푠ℎ푒푟 푚 respectively, we will compare them with the variance of each parameter.

29 1) Var(휇̂푦) vs Var(휇̂푦)퐹푖푠ℎ푒푟

Since

휎2 푛 − 푚 훽2휎2 푒 [1 + ] + 푥 1 − 훿 2 Var(휇̂ ) 푚 푛(푚 − 3) 푛 휎푒 푦 = = 1 + 푚 − 3 → 1 푎푠 푚 → ∞ Var(휇̂ ) 휎2 훽2휎2 휎2 + 훿훽2휎2 푦 퐹푖푠ℎ푒푟 푒 + 푥 푒 푥 푚 푛

푚 where 훿 = = 푐표푛푠푡푎푛푡 . So 휇̂ is an asymptotically efficient estimator. 푛 y

2) Var(휇̂푥) vs Var(휇̂푥)퐹푖푠ℎ푒푟

Since

2 휎푥 Var(휇̂푥) 푛 = 2 = 1 Var(휇̂푥)퐹푖푠ℎ푒푟 휎푥 푛

So 휇̂x is an efficient estimator.

̂ ̂ 3) Var(훽) vs Var(훽)퐹푖푠ℎ푒푟 Since

2 휎푒 Var(훽̂) (푚 − 3)휎2 푚 = 푥 = → 1 푎푠 푚 → ∞ ̂ 2 Var(훽)퐹푖푠ℎ푒푟 휎푒 푚 − 3 2 푚휎푥

So 훽̂ is an asymptotically efficient estimator.

2 2 2 푑퐸(휎̂푥) 2 4) 푉푎푟(휎̂푥 ) vs [ 2 ] 푉푎푟(휎̂푥 )퐹푖푠ℎ푒푟 푑휎푥

2 The expectation of 휎̂푥 is

푛 − 1 퐸(휎̂2) = 휎2 푥 푛 푥

2 and its derivative on 휎푥 is

2 푑퐸(휎̂푥 ) 푛 − 1 2 = 푑휎푥 푛

30 We have

2(푛 − 1) 2 4 푉푎푟(휎̂ ) 2 휎푥 푛 푥 = 푛 = → 1 푎푠 푛 → ∞ 2 2 2 4 푛 − 1 푑퐸(휎̂푥 ) 2 푛 − 1 2휎푥 [ 2 ] 푉푎푟(휎̂푥 )퐹푖푠ℎ푒푟 ( ) 푑휎푥 푛 푛

2 So, 휎̂푥 is an asymptotically efficient estimator.

2 2 2 푑E(휎̂푒 ) 2 5) Var(휎̂푒 ) vs [ 2 ] Var(휎̂푒 )퐹푖푠ℎ푒푟 푑휎푒

2 Since the expectation of 휎̂푒 is

푚 − 2 E(휎̂2) = 휎2 푒 푚 푒

2 and its derivative on 휎푒 is

2 푑E(휎̂푒 ) 푚 − 2 2 = 푑휎푒 푚

We have

2(푚 − 2)휎4 2 푒 Var(휎̂ ) 2 푚 푒 = 푚 = → 1 푎푠 푚 → ∞ 2 2 2 4 푚 − 2 푑E(휎̂푒 ) 2 푚 − 2 2휎푒 [ 2 ] Var(휎̂푒 )퐹푖푠ℎ푒푟 ( ) 푑휎푒 푚 푚

2 So, 휎̂푒 is an asymptotically efficient estimator.

2.4 Prediction

Suppose we have a future observation (푋0, 푌0) with a bivariate normal distribution

2 푋0 휇푥 휎푥 휎푥푦 (2 - 81) ( ) ~푁2 ([휇 ] , [ 2 ]) 푌0 푦 휎푦푥 휎푦

We have the following three kinds of prediction interval for 푌0:

31 1) Usual prediction interval for 푌0 – conditioning on 푋 = 푥 푎푛푑 푋0 = 푥0

2) Prediction interval for 푌0 – unconditional on X, but conditioning on 푋0 = 푥0

3) Unconditional prediction interval for 푌0

2.4.1Usual prediction interval

– Conditioning on 푋 = 푥 푎푛푑 푋0 = 푥0

The prediction value of 푌0 given X = 푥 푎푛푑 X0 = 푥0 is

̂ 푌̂0|푥, 푥0 = 휇̂푦 + 훽(푥0 − 푥̅푛) (2 - 82)

By (2 − 32), (2 − 35) and our assumption (2 − 81), the distribution of (훽̂|푥) is

2 휎푒 (훽̂|푥) ~ 푁 (훽, ) (2 - 83) 푚 2 ∑푗=1(푥푗 − 푥̅푚)

By (2 − 44), (2 − 47) and our assumption (2 − 81), the distribution of (휇̂푦|푥) is

2 2 2 휎푒 휎푒 (푥̅푚 − 푥̅푛) (휇̂푦|푥) ~ 푁 (휇푦 + 훽(푥̅푛 − 휇푥), + ) (2 - 84) 푚 푚 2 ∑푗=1(xj − 푥̅푚)

̂ and (휇̂푦, 훽) are independent of (푋0, 푌0).

So, by (2 − 83) and (2 − 84), the expectation of 푌̂0|푥, 푥0 is

̂ 퐸(푌̂0|푥, 푥0) = 퐸(휇̂푦|푥, 푥0) + 퐸(훽|푥, 푥0)(푥0 − 푥̅푛)

̂ = 퐸(휇̂푦|푥) + 퐸(훽|푥)(푥0 − 푥̅푛)

= 휇푦 + 훽(푥̅푛 − 휇푥) + 훽(푥0 − 푥̅푛) = 휇푦 + 훽(푥0 − 휇푥) (2 - 85)

The variance of 푌̂0|푥, 푥0 is

32 2 ̂ ̂ 푉푎푟(푌̂0|푥, 푥0) = 푉푎푟(휇̂푦|푥) + (푥0 − 푥̅푛) 푉푎푟(훽|푥) + 2(푥0 − 푥̅푛)퐶표푣[(휇̂푦, 훽)|푥]

휎2 휎2(푥̅ − 푥̅ )2 휎2(푥 − 푥̅ )2 2휎2(푥 − 푥̅ )(푥̅ − 푥̅ ) = 푒 + 푒 푚 푛 + 푒 0 푛 − 푒 0 푛 푚 푛 푚 푚 2 푚 2 푚 2 ∑푗=1(xj − 푥̅푚) ∑푗=1(푥푗 − 푥̅푚) ∑푗=1(푥푗 − 푥̅푚)

휎2 휎2[(푥 − 푥̅ ) − (푥̅ − 푥̅ )]2 = 푒 + 푒 0 푛 푚 푛 푚 푚 2 ∑푗=1(xj − 푥̅푚)

2 2 2 휎푒 휎푒 (푥0 − 푥̅푚) = + (2 - 86) 푚 푚 2 ∑푗=1(xj − 푥̅푚) where

̂ ̂ ̂ 퐶표푣[(휇̂푦, 훽)|푥] = 퐶표푣[(푌̅푚 − 훽(푥̅푚 − 푥̅푛), 훽)|푥]

̂ ̂ ̂ = 퐶표푣(푌̅푚, 훽)|푥 − (푥̅푚 − 푥̅푛)퐶표푣(훽, 훽)|푥

휎2(푥̅ − 푥̅ ) = − 푒 푚 푛 푚 2 (2 - 87) ∑푗=1(푥푗 − 푥̅푚)

By assumption, we have

2 푌0|푥, 푥0 ~ 푁(휇푦 + 훽(푥0 − 휇푥), 휎푒 ) (2 - 88)

By (2 − 85) and (2 − 88), the expectation of (푌0 − 푌̂0)|푥, 푥0 is

퐸(푌0 − 푌̂0)|푥, 푥0 = 퐸(푌0|푥, 푥0) − 퐸(푌̂0|푥, 푥0)

= 휇푦 + 훽(푥0 − 휇푥) − [휇푦 + 훽(푥0 − 휇푥)] = 0 (2 - 89)

By (2 − 86) and (2 − 88), the variance of (푌0 − 푌̂0)|푥, 푥0 is

푉푎푟(푌0 − 푌̂0)|푥, 푥0 = 푉푎푟(푌0)|푥, 푥0 + 푉푎푟(푌̂0)|푥, 푥0 − 2퐶표푣(푌̂0, 푌0)|푥, 푥0

2 2 2 2 휎푒 휎푒 (푥0 − 푥̅푚) = 휎푒 + + − 2 ∙ 0 푚 푚 2 ∑푗=1(xj − 푥̅푚)

2 2 1 (푥0 − 푥̅푚) = 휎푒 [1 + + ] (2 - 90) 푚 푚 2 ∑푗=1(xj − 푥̅푚)

33 where

̂ Cov(푌̂0, 푌0)|푥, 푥0 = Cov[휇̂푦 + 훽(푥0 − 푥̅푛), 푌0]|푥, 푥0 = 0

Since 푌0 푎푛푑 푌̂0 are normal, so 푌0 − 푌̂0 is normal, then

(푌 − 푌̂ )|푥, 푥 − 퐸(푌 − 푌̂ )|푥, 푥 푍 = 0 0 0 0 0 0 ~ 푁(0,1) 1 (푥 − 푥̅ )2 √휎2 [1 + + 0 푚 ] 푒 푚 푚 2 ∑푗=1(푥푗 − 푥̅푚)

By (2 − 78),

2 푚휎̂푒 2 푈 = 2 ~휒푚−2 휎푒 and 푍 푎푛푑 푈 are independent, so

푍 푇 = ~ 푡푚−2 √푈/(푚 − 2) i.e.,

(푌0 − 푌̂0)|푥, 푥0 − 0 1 (푥 − 푥̅ )2 √휎2 [1 + + 0 푚 ] 푒 푚 푚 2 ∑푗=1(xj − 푥̅푚) (푌0 − 푌̂0)|푥, 푥0 푇 = = ~ 푡푚−2 2 2 2 푚휎̂푒 푚휎̂푒 1 (푥0 − 푥̅푚) √ 2 √ [1 + + 2] (푚 − 2)휎푒 푚 − 2 푚 푚 ∑푗=1(xj − 푥̅푚)

The 95% prediction interval for 푌0 푔푖푣푒푛 푋 = 푥 푎푛푑 푋0 = 푥0 is

2 2 푚휎̂푒 1 (푥0 − 푥̅푚) 푌̂0|푥, 푥0 ± 푡0.025,푚−2√ [1 + + ] (2 - 91) 푚 − 2 푚 푚 2 ∑푗=1(xj − 푥̅푚)

2.4.2 Prediction interval

– Unconditional on 푋, but conditioning on 푋0 = 푥0

34 The prediction value of 푌0 given 푋0 = 푥0 is

̂ 푌̂0|푥0 = 휇̂푦 + 훽(푥0 − 휇̂푥) (2 - 92)

By (2 − 34) and (2 − 46), the expectation of 푌̂0|푥0 is

̂ 퐸(푌̂0|푥0) = 퐸(휇̂푦|푥0) + 퐸[훽(푥0 − 휇̂푥)|푥0]

̂ ̂ = 퐸(휇̂푦) + 푥0퐸(훽) − 퐸(훽휇̂푥)

= 휇푦 + 훽(푥0 − 휇푥) (2 - 93) where

̂ ̂ ̂ 퐸(훽휇̂푥) = 퐸[퐸(훽휇̂푥)|푥] = 퐸[휇̂푥퐸(훽)|푥] = 훽퐸(휇̂푥) = 훽휇푥 (2 - 94)

̂ ̂ 휇̂푥 and 훽 are not correlated, and by assumptions (휇̂푥, 훽) are independent of (푋0, 푌0).

We derive the variance of 푌̂0|푥0 by the ∆-Method. Let

푍1 휇̂푦 풁 = [푍2] = [휇̂푥] (2 - 95) 푍3 훽̂

풁 푎푛푑 푋0 are independent. The expectation and covariance of 풁 given 푋0 = 푥0 are

퐸(푍1) 퐸(휇̂푦) 휇푦 E(풁|푥0) = E(풁) = 흁푍 = [퐸(푍2)] = [퐸(휇̂푥)] = [휇푥] (2 - 96) 퐸(푍3) 퐸(훽̂) 훽 and

푉푎푟(푍1) 퐶표푣(푍1, 푍2) 퐶표푣(푍1, 푍3) ′ Cov(풁|푥0) = Cov(풁) = 퐸[풁 − E(풁)][풁 − E(풁)] = [퐶표푣(푍1, 푍2) 푉푎푟(푍2) 퐶표푣(푍2, 푍3)] 퐶표푣(푍1, 푍3) 퐶표푣(푍2, 푍3) 푉푎푟(푍3)

휎2 푛 − 푚 훽2휎2 훽휎2 푒 [1 + ] + 푥 푥 0 푚 푛(푚 − 3) 푛 푛 2 2 훽휎푥 휎푥 = 0 (2 - 97) 푛 푛 2 휎푒 0 0 2 [ (푚 − 3)휎푥 ]

35 where, by (2 − 56), we have

휎2 푛 − 푚 훽2휎2 푉푎푟(푍 ) = Var(휇̂ ) = 푒 [1 + ] + 푥 1 푦 푚 푛(푚 − 3) 푛

by (2 − 25),

휎2 푉푎푟(푍 ) = Var(휇̂ ) = 푥 2 푥 푛

by (2 − 36),

휎2 ̂ 푒 푉푎푟(푍3) = Var(훽) = 2 (푚 − 3)휎푥

By Appendix C, we have

1 퐶표푣(푍 , 푍 ) = 퐶표푣(휇̂ , 휇̂ ) = 훽휎2 1 2 푦 푥 푛 푥

̂ Cov(푍1, 푍3) = Cov(휇̂푦, 훽) = 0

̂ Cov(푍2, 푍3) = Cov(휇̂푥, 훽) = 0

In terms of 풛, we have

푦̂0|푥0 = 푧1 + 푧3 (푥표 − 푧2 ) (2 - 98)

By the ∆-method,

3 ′ 푦̂0|푥0 ≈ 휇푦 + 훽(푥0 − 휇푥) + ∑(푦̂0푗|푥0)( 푧푗 − 휇푧푗) (2 - 99) 푗=1 where

휕푦 0 휕푧 푦̂′ |푥 1 1 01 0 휕푦 ′ ′ 0 −훽 푦̂ 0|푥0 = [푦̂02|푥0] = = [ ] (2 - 100) ′ 휕푧2 푦̂03|푥0 푥0 − 휇푥 휕푦0 [휕푧 ] 3 푧=흁푧

Hence, the expectation of 푌̂0|푥0 is

36 3 ′ 퐸(푌̂0|푥0) ≈ 휇푦 + 훽(푥0 − 휇푥) + ∑(푦̂0푗|푥0)퐸( 푍푗 − 휇푍푗) = 휇푦 + 훽(푥0 − 휇푥) (2 - 101) 푗=1

The variance of 푌̂0|푥0 is

2 3 2 ′ Var(푌̂0|푥0) = 퐸[푌̂0|푥0 − 퐸(푌̂0|푥0)] ≈ 퐸 ⌈∑(푦̂0푗|푥0)( 푍푗 − 휇푍푗)⌉ 푗=1

3 3 3 ′ 2 ′ ′ = ∑[(푦̂0푗|푥0)] 푉푎푟(푍푗) + 2 ∑ ∑ (푦̂0푖|푥0)(푦̂0푗|푥0)퐶표푣(푍푖, 푍푗) 푗=1 푖=1 푗=1,푖≠푗

2 2 2 = 1 푉푎푟(푍1) + (−훽) 푉푎푟(푍2) + (푥0 − 휇푥) 푉푎푟(푍3) + 2 ∙ 1 ∙ (−훽)퐶표푣(푍1, 푍2)

2 2 2 ̂ = 1 푉푎푟(휇̂푦) + (−훽) 푉푎푟(휇̂푥) + (푥0 − 휇푥) 푉푎푟(훽) + 2 ∙ 1 ∙ (−훽)퐶표푣(휇̂푦, 휇̂푥)

2 2 2 2 2 2 2 2 휎푒 푛 − 푚 훽 휎푥 훽 휎푥 2 휎푒 훽 휎푥 = [1 + ] + + + (푥0 − 휇푥) − 2 푚 푛(푚 − 3) 푛 푛 (푚 − 3)휎푥푥 푛

( )2 2 1 푛 − 푚 푥0 − 휇푥 = 휎푒 [ + + 2] (2 - 102) 푚 푚푛(푚 − 3) (푚 − 3)휎푥

By assumptions,

2 푌0|푥0 ~ 푁(휇푦 + 훽(푥0 − 휇푥), 휎푒 ) (2 - 103)

So, the expectation of (푌0 − 푌̂0)|푥0 is

퐸(푌0 − 푌̂0)|푥0 = 퐸(푌0|푥0) − 퐸(푌̂0|푥0)

= 휇푦 + 훽(푥0 − 휇푥) − [휇푦 + 훽(푥0 − 휇푥)] = 0 (2 - 104)

37 The variance of (푌0 − 푌̂0)|푥0 is

푉푎푟(푌0 − 푌̂0)|푥0 = 푉푎푟(푌0)|푥0 + 푉푎푟(푌̂0)|푥0 − 2퐶표푣(푌̂0, 푌0)|푥0

( )2 2 2 1 푛 − 푚 푥0 − 휇푥 = 휎푒 + 휎푒 [ + + 2] − 2 ∙ 0 푚 푚푛(푚 − 3) (푚 − 3)휎푥

( )2 2 1 푛 − 푚 푥0 − 휇푥 = 휎푒 [1 + + + 2] (2 - 105) 푚 푚푛(푚 − 3) (푚 − 3)휎푥

where

̂ Cov(푌̂0, 푌0)|푥0 = Cov[휇̂푦 + 훽(푥0 − 휇̂푥), 푌0]|푥0 = 0

When sample is large,

(푌 − 푌̂ )|푥 − 퐸(푌 − 푌̂ )|푥 푍 = 0 0 0 0 0 0 ~̇ 푁(0,1) 2 2 1 푛 − 푚 (푥0 − 휇푥) √휎푒 [1 + + + 2] 푚 푚푛(푚 − 3) (푚 − 3)휎푥

The 95% prediction interval for 푌0 given 푋0 = 푥0 is

1 푛 − 푚 (푥 − 푋̅ )2 ̂ 2 0 푛 (2 - 106) 푌0|푥0 ± 푧0.025√S푒 [1 + + + 2 ] 푚 푚푛(푚 − 3) (푚 − 3)Sxn

2 2 where Sxn and S푒 are given in (2 − 28) and (2 − 61), respectively.

2.4.3 Unconditional prediction interval

By assumptions we have

2 푌0 ~ 푁(휇푦, 휎푦 ) (2 - 107)

2 푋0 ~ 푁(휇푥, 휎푥 ) (2 - 108)

The prediction value of 푌0 is

38 ̂ 푌̂0 = 휇̂푦 + 훽(푋0 − 휇̂푥) (2 - 109)

By (2 − 34) and (2 − 46), the expectation of 푌̂0 is

̂ 퐸(푌̂0) = 퐸(휇̂푦) + 퐸[훽(푋0 − 휇̂푥)] = 휇푦 (2 - 110)

where

̂ ̂ ̂ ̂ ̂ 퐸[훽(푋0 − 휇̂푥)] = 퐸(훽푋0) − 퐸(훽휇̂푥) = 퐸(훽)퐸(푋0) − 퐸(훽)퐸(휇̂푥) = 훽휇푥 − 훽휇푥 = 0

̂ ̂ 휇̂푥 and 훽 are not correlated by (2 − 94) , and by assumptions (휇̂푥, 훽) are independent of (푋0, 푌0).

We derive the variance of 푌̂0 by the ∆-Method. Let

푍 휇̂푦 1 푍 휇̂ 풁 = [ 2] = 푥 (2 - 111) 푍3 훽̂ 푍 4 [푋0]

The expectation and covariance of 풁 are

퐸(푍 ) 퐸(휇̂푦) 휇 1 푦 퐸(푍2) 퐸(휇̂푥) 휇푥 E(풁) = 흁푍 = [ ] = = [ ] (2 - 112) 퐸(푍3) 퐸(훽̂) 훽 휇푥 퐸(푍4) [퐸(푋0)] and

푉푎푟(푍 ) 퐶표푣(푍 , 푍 ) 퐶표푣(푍 , 푍 ) 퐶표푣(푍 , 푍 ) 1 1 2 1 3 1 4 퐶표푣(푍 , 푍 ) 푉푎푟(푍 ) 퐶표푣(푍 , 푍 ) 퐶표푣(푍 , 푍 ) Cov(풁) = 퐸[풁 − E(풁)][풁 − E(풁)]′ = 1 2 2 2 3 2 4 퐶표푣(푍1, 푍3) 퐶표푣(푍2, 푍3) 푉푎푟(푍3) 퐶표푣(푍3, 푍4) [퐶표푣(푍1, 푍4) 퐶표푣(푍2, 푍4) 퐶표푣(푍3, 푍4) 푉푎푟(Z4) ]

39 2 2 2 2 휎푒 푛 − 푚 훽 휎푥 훽휎푥 [1 + ] + 0 0 푚 푛(푚 − 3) 푛 푛 2 2 훽휎푥 휎푥 0 0 = 푛 푛 (2 - 113) 휎2 푒 0 0 2 0 (푚 − 3)휎푥 0 0 2 [ 0 휎푥 ]

In terms of 풛, we have

푦̂0 = 푧1 + 푧3 (푧4 − 푧2 ) (2 - 114)

By the ∆-method,

4 ′ 푦̂0 ≈ 휇푦 + ∑ 푦̂0푗(흁푍)( 푧푗 − 휇푍푗) (2 - 115) 푗=1 where 휕푦 0 휕푧1 ′ 푦̂ (흁 ) 휕푦0 01 푍 1 푦̂′ (흁 ) ′ 02 푍 휕푧2 −훽 푦̂ 0(흁푍) = ′ = = [ ] (2 - 116) 푦̂03(흁푍) 휕푦0 0

[푦̂′ (흁 )] 훽 04 푍 휕푧3 휕푦0 [휕푧 ] 4 푧=흁푍 so

4 ′ 퐸(푌̂0) ≈ 휇푦 + ∑ 푦̂0푗(흁푍)퐸( 푍푗 − 휇푍푗) = 휇푦 (2 - 117) 푗=1

The variance of 푌̂0 is

2 4 2 ′ Var(푌̂0) = E[푌̂0 − 퐸(푌̂0)] ≈ E ⌈∑ 푦̂0푗(흁푍)퐸( 푍푗 − 휇푍푗)⌉ 푗=1

4 4 4 ′ 2 ′ ′ = ∑[푦̂0푗(흁푍)] 푉푎푟(푍푗) + 2 ∑ ∑ 푦̂0푖(흁푍)푦̂0푗(흁푍)퐶표푣(푍푖, 푍푗) 푗=1 푖=1 푗=1,푖≠푗

40 2 2 2 2 = 1 푉푎푟(푍1) + (−훽) 푉푎푟(푍2) + 0 ∙ 푉푎푟(푍3) + 훽 푉푎푟(푍4) + 2 ∙ 1 ∙ (−훽)퐶표푣(푍1, 푍2)

2 2 2 ̂ 2 = 1 푉푎푟(휇̂푦) + (−훽) 푉푎푟(휇̂푥) + 0 ∙ 푉푎푟(훽) + 훽 푉푎푟(푋0) + 2 ∙ 1 ∙ (−훽)퐶표푣(휇̂푦, 휇̂푥)

휎2 푛 − 푚 훽2휎2 훽2휎2 훽2휎2 = 푒 [1 + ] + 푥 + 푥 + 0 + 훽2휎2 − 2 푥 푚 푛(푚 − 3) 푛 푛 푥 푛

휎2 푛 − 푚 = 푒 [1 + ] + 훽2휎2 (2 - 118) 푚 푛(푚 − 3) 푥

The expectation and variance of (푌0 − 푌̂0) are

퐸(푌0 − 푌̂0) = 퐸(푌0) − 퐸(푌̂0) = 휇푦 − 휇푦 = 0 (2 - 119)

푉푎푟(푌0 − 푌̂0) = 푉푎푟(푌0) + 푉푎푟(푌̂0) − 2퐶표푣(푌̂0, 푌0)

휎2 푛 − 푚 = 휎2 + 푒 [1 + ] + 훽2휎2 − 2 ∙ 훽2휎2 푦 푚 푛(푚 − 3) 푥 푥

휎2 푛 − 푚 = 휎2 + 푒 [1 + ] − 훽2휎2 푦 푚 푛(푚 − 3) 푥

1 푛 − 푚 = 휎2 [1 + + ] (2 - 120) 푒 푚 푚푛(푚 − 3)

where

̂ Cov(푌̂0, 푌0) = Cov[휇̂푦 + 훽(X0 − 휇̂푥), 푌0]

̂ ̂ = Cov(휇̂푦, 푌0) + Cov(훽X0, 푌0) − Cov(훽휇̂푥, 푌0)

̂ ̂ = 0 + 퐸(훽X0푌0) − 퐸(훽X0)퐸(푌0) − 0

̂ ̂ = 퐸(훽)E(X0푌0) − 퐸(훽)퐸(X0)퐸(푌0)

2 2 = 훽[퐸(X0푌0) − 퐸(X0)퐸(푌0)] = 훽휎푥푦 = 훽 휎푥 (2 - 121)

When sample is large,

(푌 − 푌̂ ) − 퐸(푌 − 푌̂ ) 푍 = 0 0 0 0 ~̇ 푁(0,1) 1 푛 − 푚 √휎2 [1 + + ] 푒 푚 푚푛(푚 − 3)

41 The 95% prediction interval for 푌0 is

1 푛 − 푚 푌̂ ± 푧 √S2 [1 + + ] (2 - 122) 0 0.025 푒 푚 푚푛(푚 − 3)

2 where S푒 is given in (2 − 61).

This is unconditional (not dependent on 푋0) prediction interval for 푌0 of a future observation.

2.5 An Example for the Bivariate Situation

College Admission data (Table 1-1) is from Han and Li (2011). In this example, we take TOEFL score as Y and GRE Verbal, Quantitative and Analytic as X, respectively to estimate above 5 maximum likelihood estimators. Normality test shows that 푌 and 푋 are normally distributed.

If we consider all information 푋1, 푋2, … , 푋푛, the five estimators are

Table 2-4 Estimators with All Information

Considering the regression of Y on GRE Verbal ̂ 2 2 X 휇̂푥 휇̂푦 훽 휎̂푥 휎̂푒 Verbal 419.50 552.44 0.1727 11614.75 826.19

Considering the regression of Y on GRE Quantitative ̂ 2 2 X 휇̂푥 휇̂푦 훽 휎̂푥 휎̂푒 Quantitative 646.50 540.98 -0.0301 11707.75 918.78

Considering the regression of Y on GRE Analytic ̂ 2 2 X 휇̂푥 휇̂푦 훽 휎̂푥 휎̂푒 Analytic 523.00 541.62 0.0471 12239.75 898.47

If we do not consider extra information 푋푚+1, 푋푚+2, … , 푋푛 and only consider first 푚 observations, the five estimators are

42 Table 2-5 Estimators without Extra Information

Considering the regression of Y on GRE Verbal ̂ 2 2 X 휇̂푥_푛표 휇̂푦_푛표 훽푛표 휎̂푥_푛표 휎̂푒_푛표 Verbal 342.00 539.05 0.1727 3276.00 826.19

Considering the regression of Y on GRE Quantitative ̂ 2 2 X 휇̂푥_푛표 휇̂푦_푛표 훽푛표 휎̂푥_푛표 휎̂푒_푛표 Quantitative 710.50 539.05 -0.0301 5694.75 918.78

Considering the regression of Y on GRE Analytic ̂ 2 2 X 휇̂푥_푛표 휇̂푦_푛표 훽푛표 휎̂푥_푛표 휎̂푒_푛표 Analytic 468.50 539.05 0.0471 11500.25 898.47

43

Chapter 3

STATISTICAL ESTIMATION IN MULTIPLE REGRESSION MODEL

WITH A BLOCK OF MISSING OBSERVATIONS

푿 푇 Let [ ] = [푋 , 푋 , … , 푋 , 푌] have a multivariate normal distribution with mean 푌 1 2 푝 vector 흁 and covariance matrix 횺

흁풙 푇 횺퐱퐱 횺퐱퐲 흁 = [휇 ] = [휇푥1, 휇푥2, … , 휇푥푝, 휇푦] , 횺 = [ 2 ] 푦 횺퐲퐱 σ푦

Suppose we have the following random sample with a block of missing Y values:

푋1,1 푋1,2 ⋯ 푋1,푝 푌1

푋2,1 푋2,2 ⋯ 푋2,푝 푌2

⋮ ⋮ ⋮ ⋮ ⋮

푋푚,1 푋푚,2 ⋯ 푋푚,푝 푌푚

푋푚+1,1 푋푚+1,2 ⋯ 푋푚+1,푝

⋮ ⋮ ⋮ ⋮

푋푛,1 푋푛,1 ⋯ 푋푛,푝

Based on the data, we want to estimate the parameters. We can write the multivariate normal probability density function (pdf) as

푓(퐱, 푦) = 푔(푦|퐱)ℎ(퐱) (3 - 1) where 푔(푦|퐱) is the conditional pdf of 푌 given 푿 = 퐱, and ℎ(퐱) is the marginal pdf of 푿.

2 1 1 2 푔푌|푿(yj|퐱퐣; μy, 훍퐱, 훃, σe) = exp {− 2 [yj − 퐸(yj|퐱퐣)] } √2πσe 2σe

1 1 푇 2 = exp { − 2 [ y j − 휇 푦 − 휷 ( 퐱퐣 − 훍퐱)] } 푗 = 1,2, … , 푚 (3 - 2) √2πσe 2σe

44 1 1 푇 −1 ℎ푿(퐱퐢; 훍퐱, 횺퐱퐱) = exp {− (퐱풊 − 흁풙) 횺퐱퐱 (퐱풊 − 흁풙)} 푖 = 1,2, … , 푛 (3 - 3) 2 √2π|횺퐱퐱|

where

−ퟏ 푻 퐸(yj|퐱퐣) = 휇푦 + 횺퐲퐱횺퐱퐱 (풙풋 − 흁풙) = 휇푦 + 휷 (풙풋 − 흁풙) (3 - 4)

−ퟏ 휷 = 횺퐱퐱 횺퐱퐲 (3 - 5)

2 2 −ퟏ 2 푇 휎푒 = 휎푦 − 횺퐲퐱횺퐱퐱 횺퐱퐲 = 휎푦 − 휷 횺퐱퐱휷 (3 - 6)

The joint likelihood function is

m n 2 2 L(μy, 훍퐱, 훃, 횺퐱퐱, σe) = ∏ 푔푌|푿(yj|퐱퐣; μy, 훍퐱, 훃, σe) ∏ ℎ푿(퐱퐢; 훍퐱, 횺퐱퐱) j=1 i=1

푚 푛+푚 − 1 2 2 2 −m 푇 L(μy, 훍퐱, 훃, 횺퐱퐱, σe) = (2휋) σe 푒푥푝 {− 2 ∑[yj − 휇푦 − 휷 (퐱퐣 − 훍퐱)] } 2σe 푗=1

푛 푛 − 1 ∙ |횺 | 2푒푥푝 {− ∑(퐱 − 흁 )푇횺−1(퐱 − 흁 )} (3 - 7) 퐱퐱 2 풊 풙 퐱퐱 풊 풙 푖=1

We will derive the maximum likelihood estimators by maximizing the likelihood function in the following section.

3.1 Maximum Likelihood Estimators

To obtain maximum likelihood estimators, we need to maximize following (3 − 8) and (3 − 9) simultaneously

푛 푛 − 1 |횺 | 2푒푥푝 {− ∑(퐱 − 흁 )푇횺−1(퐱 − 흁 )} (3 - 8) 퐱퐱 2 풊 풙 퐱퐱 풊 풙 푖=1

45 푚 −푚 1 푇 2 휎푒 푒푥푝 {− 2 ∑[푦푗 − 휇푦 − 휷 (퐱퐣 − 훍퐱)] } (3 - 9) 2휎푒 푗=1

Let us consider the exponent and find the MLE of 휇푦 , 훍퐱 and 휷 to minimize

푚 푛 1 푇 2 1 푇 −1 2 ∑[푦푗 − 휇푦 − 휷 (퐱퐣 − 훍퐱)] + ∑(퐱풊 − 흁풙) 횺퐱퐱 (퐱풊 − 흁풙) (3 - 10) 2σe 2 푗=1 푖=1

Since the sum of of the matrix is equal to the trace of sum of the matrix, we have

푛 푛 1 1 ∑(퐱 − 흁 )푇횺−1(퐱 − 흁 ) = ∑ 푡푟[(퐱 − 흁 )푇횺−1(퐱 − 흁 )] 2 풊 풙 퐱퐱 풊 풙 2 풊 풙 퐱퐱 풊 풙 푖=1 푖=1

푛 1 = ∑ 푡푟[횺−1(퐱 − 흁 )(퐱 − 흁 )푇] 2 퐱퐱 풊 풙 풊 풙 푖=1

푛 1 = 푡푟 {∑ 횺−1(퐱 − 흁 )(퐱 − 흁 )푇} 2 퐱퐱 풊 풙 풊 풙 푖=1

푛 1 = 푡푟 {횺−1 [∑(퐱 − 흁 )(퐱 − 흁 )푇]} (3 - 11) 2 퐱퐱 풊 풙 풊 풙 푖=1

let

푛 푛 푛 푇 1 푇 퐱̅ = [∑ 푥 , ∑ 푥 , … , ∑ 푥 ] = [퐱̅ , 퐱̅ , … , 퐱̅ ] 푛 푛 푖1 푖2 푖푝 푛1 푛2 푛푝 푖=1 푖=1 푖=1 and rewrite each (퐱풊 − 흁풙) as

퐱풊 − 흁풙 = (퐱풊 − 퐱̅푛) + (퐱̅푛 − 흁풙)

We have

푛 푛 푇 푇 ∑(퐱풊 − 흁풙)(퐱풊 − 흁풙) = ∑(퐱풊 − 퐱̅푛 + 퐱̅푛 − 흁풙)(퐱풊 − 퐱̅푛 + 퐱̅푛 − 흁풙) 푖=1 푖=1

46 푛 푇 푇 = ∑(퐱풊 − 퐱̅푛)(퐱풊 − 퐱̅푛) + 푛(퐱̅푛 − 흁풙)(퐱̅푛 − 흁풙) (3 - 12) 푖=1 where the cross-product terms

푛 푛 푇 푇 ∑(퐱풊 − 퐱̅푛)(퐱̅푛 − 흁풙) = [∑(퐱풊 − 퐱̅푛)] (퐱̅푛 − 흁풙) = ퟎ 푖=1 푖=1 and

푛 푛 푇 푇 ∑(퐱̅푛 − 흁풙)(퐱풊 − 퐱̅푛) = (퐱̅푛 − 흁풙) [∑(퐱풊 − 퐱̅푛) ] = ퟎ 푖=1 푖=1

푛 푇 Replace ∑푖=1(퐱풊 − 흁풙)(퐱풊 − 흁풙) in (3 − 11) with (3 − 12), we obtain

푛 푛 1 1 ∑(퐱 − 흁 )푇횺−1(퐱 − 흁 ) = 푡푟 {횺−1 [∑(퐱 − 퐱̅ )(퐱 − 퐱̅ )푇 + 푛(퐱̅ − 흁 )(퐱̅ − 흁 )푇]} 2 풊 풙 퐱퐱 풊 풙 2 퐱퐱 풊 푛 풊 푛 푛 풙 푛 풙 푖=1 푖=1

푛 1 푛 = 푡푟 {횺−1 [∑(퐱 − 퐱̅ )(퐱 − 퐱̅ )푇]} + (퐱̅ − 흁 )푇횺−1(퐱̅ − 흁 ) (3 - 13) 2 퐱퐱 풊 푛 풊 푛 2 푛 풙 퐱퐱 푛 풙 푖=1

Similarly, let

푚 1 푦̅ = ∑ 푦 푚 푚 푗 푗=1

푚 푚 푚 푇 1 푇 퐱̅ = [∑ 푥 , ∑ 푥 , … , ∑ 푥 ] = [퐱̅ , 퐱̅ , … , 퐱̅ ] 푚 푚 푖1 푖2 푖푝 푚1 푚2 푚푝 푖=1 푖=1 푖=1

푇 Each [푦푗 − 휇푦 − 휷 (퐱퐣 − 흁풙)] can be written as

푇 푇 푇 푦푗 − 휇푦 − 휷 (퐱퐣 − 흁풙) = [푦푗 − 푦̅푚 − 휷 (퐱퐣 − 퐱̅푚)] + [푦̅푚 − 휇푦 − 휷 (퐱̅푚 − 흁풙)]

Then we get

47 푚 1 푇 2 ∑[yj − 휇푦 − 휷 (퐱퐣 − 흁풙)] 2 2 σe 푗=1 푚 2 1 푇 푇 = ∑ {[y − 푦̅푚 − 휷 (퐱퐣 − 퐱̅푚)] + [푦̅푚 − 휇 − 휷 (퐱̅푚 − 흁풙)]} 2 2 푗 푦 σe 푗=1

푚 1 푇 2 푚 푇 2 = 2 ∑[y푗 − 푦̅푚 − 휷 (퐱퐣 − 퐱̅푚)] + 2 [푦̅푚 − 휇푦 − 휷 (퐱̅푚 − 흁풙)] (3 - 14) 2σe 2σe 푗=1

where the cross-product term

푚 푇 푇 ∑[y푗 − 푦̅푚 − 휷 (퐱퐣 − 퐱̅푚)] [푦̅푚 − 휇푦 − 휷 (퐱̅푚 − 흁풙)] = 0 푗=1

So, if we minimize (3 − 13) and (3 − 14) simultaneously, (3 − 10) will be minimized.

−1 First, let us consider (3 − 13). Since 횺퐱퐱 is positive definite, each term in (3 − 13)

푇 −1 is greater than or equal to zero. The second term 푛(퐱̅푛 − 흁풙) 횺퐱퐱 (퐱̅푛 − 흁풙)/2 can be minimized if we set 훍퐱 = 퐱̅n, so we have MLE for 훍퐱

훍̂퐱 = 퐱̅n (3 - 15)

Second, let us consider (3 − 14). Both terms in it are non-negative, to minimize the first term in (3 − 14),i.e.,

푚 푇 2 푚푖푛 ∑[y푗 − 푦̅푚 − 휷 (퐱퐣 − 퐱̅푚)] 푗=1

We take derivative with respect to 휷 first, then set the derivative to zero, and obtain the MLE for 휷 which makes the above minimum. By method in Petersen and

Pedersen (2012),

푚 푇 2 푚 ∂ ∑ [y푗 − 푦̅푚 − 휷 (퐱퐣 − 퐱̅푚)] 휕 푗=1 = −2 ∑ (y − 푦̅ − 휷푇(퐱 − 퐱̅ )) {휷푇(퐱 − 퐱̅ )} 휕휷 푗 푚 퐣 푚 휕휷 퐣 푚 푗=1

48 푚 푇 = −2 ∑ (y푗 − 푦̅푚 − 휷 (퐱퐣 − 퐱̅푚)) (퐱퐣 − 퐱̅퐦) 푗=1 and set above to 0, we have

푚 푇 ∑ (y푗 − 푦̅푚 − 휷 (퐱퐣 − 퐱̅푚)) (퐱퐣 − 퐱̅퐦) = ퟎ 푗=1

Solve above equation, we have

̂ −ퟏ 휷 = 퐒xxm퐒xym (3 - 16) where

푚 푇 퐒퐱퐱퐦 = ∑(퐱퐣 − 퐱̅퐦)((퐱퐣 − 퐱̅퐦) (3 - 17) 푗=1

퐒퐱퐲퐦 = ∑(yj − 푦̅푚)(퐱퐣 − 퐱̅퐦) (3 - 18) 푗=1

By minimizing the second term in (3 − 14) to give

푇 2 [푦̅푚 − 휇푦 − 휷 (퐱̅푚 − 흁풙)] = 0

We have MLE for 휇푦 by solving above equation:

̂푇 휇̂푦 = 푦̅푚 − 휷 (퐱̅푚 − 훍̂퐱) (3 - 19)

Now back to maximize (3 − 8) and (3 − 9) simultaneously. When 훍̂퐱 = 퐱̅n,

(3 − 8) is reduced to

푛 푛 − 1 |횺 | 2 exp {− ∑(퐱 − 흁 )푇횺−1(퐱 − 흁 )} 퐱퐱 2 풊 풙 퐱퐱 풊 풙 푖=1

푛 푛 − 1 = |횺 | 2푒푥푝 {− 푡푟 {횺−1 [∑(퐱 − 퐱̅ )(퐱 − 퐱̅ )푇]}} (3 - 20) 퐱퐱 2 퐱퐱 풊 푛 풊 푛 푖=1

By Results 4.10 in Johnson and Wichern (1998), (3 − 20) reaches maximum when

49 푛 1 횺̂ = ∑(퐱 − 퐱̅ )(퐱 − 퐱̅ )푇 (3 - 21) 퐱퐱 푛 풊 푛 풊 푛 푖=1

̂푇 Similarly, when 휇̂푦 = 푦̅푚 − 휷 (퐱̅푚 − 훍̂퐱) 푎푛푑 훍̂퐱 = 퐱̅n, (3 − 9) is reduced to

푚 2 −m 1 푇 σe 푒푥푝 {− 2 ∑ (푦푗 − 휇푦 − 휷 (퐱퐣 − 흁풙)) } 2σe 푗=1

푚 1 −m ̂푇 2 = σe 푒푥푝 {− 2 푡푟 ∑[y푗 − 푦̅푚 − 휷 (퐱퐣 − 퐱̅푚)] } (3 - 22) 2σe 푗=1

So, by Results 4.10 in Johnson and Wichern (1998), (3 − 22) reaches maximum when

푚 1 2 σ̂2 = ∑[y − 푦̅ − 휷̂푇(퐱 − 퐱̅ )] (3 - 23) e 푚 푗 푚 퐣 푚 푗=1

In summary, we have the following 5 maximum likelihood estimators:

흁̂풙 = 푿̅푛 (3 - 24)

̂푇 휇̂푦 = 푌̅푚 − 휷 (푿̅푚 − 푿̅푛) (3 - 25)

̂ −ퟏ 휷 = 퐒xxm퐒xym (3 - 26)

푛 1 횺̂ = ∑(푿 − 푿̅ )(푿 − 푿̅ )푇 (3 - 27) 퐱퐱 푛 풊 푛 풊 푛 푖=1

푚 1 2 σ̂2 = ∑[푌 − 푌̅ − 휷̂푇(푿 − 푿̅ )] (3 - 28) e 푚 푗 푚 풋 푚 푗=1

50 Similarly, if we do not consider extra information 푿m+1,푿m+2, … , 푿n and only use the first 푚 observations, we have

흁̂풙_푛표 = 퐗̅m (3 - 29)

휇̂푦_푛표 = 푌̅m (3 - 30)

̂ −ퟏ 휷푛표 = 퐒xxm퐒xym (3 - 31)

푚 1 횺̂ = ∑(푿 − 푿̅ )(푿 − 푿̅ )푇 (3 - 32) 퐱퐱_푛표 푚 풊 푚 풊 푚 푖=1

푚 1 2 σ̂2 = ∑[푌 − 푌̅ − 휷̂푇(푿 − 푿̅ )] (3 - 33) e_푛표 푚 푗 푚 풋 푚 푗=1

3.2 Properties of the Maximum Likelihood Estimators

3.2.1 Estimator of the Mean Vector of 푿

The expectation of 흁̂풙 is

푇 푇 퐸(흁̂풙) = 퐸(푿̅풏) = 퐸[푋̅푛1, 푋̅푛2, … , 푋̅푛푝] = [퐸(푋̅푛1), 퐸(푋̅푛2), … , 퐸(푋̅푛푝)]

푇 = [휇푥1, 휇푥2, … 휇푥푝] = 흁풙 (3 - 34)

So 훍̂퐱 is an unbiased estimator. The covariance of 흁̂풙 is

푛 푛 1 Cov(훍̂ ) = Cov(퐗̅ ) = E(퐗̅ − 훍 )(퐗̅ − 훍 )푇 = {∑ ∑ E(퐗 − 훍 )(퐗 − 훍 )푇} 퐱 퐧 퐧 퐱 퐧 퐱 푛2 퐣 퐱 퐥 퐱 푗=1 푙=1

51 푛 1 푇 1 = ∑ E(퐗 − 훍 )(퐗 − 훍 ) = 횺 (3 - 35) 푛2 퐣 퐱 퐣 퐱 푛 퐱퐱 푗=1

By assumptions, 푿 ~ 푁풑(훍퐱, 횺xx), so 훍̂퐱 has a p-variate normal distribution too, i.e.,

1 훍̂ ~ 푁 (훍 , 횺 ) 퐱 푝 퐱 푛 xx

3.2.2 Estimator of the Covariance Matrix of 푿

Since 푿1,푿2, … , 푿n is a random sample of size 푛 from a p-variate normal distribution with mean 훍퐱 and covariance matrix 횺xx, so

푛 푇 ∑(푿풊 − 푿̅푛)(푿풊 − 푿̅푛) ~ 푊푝( 횺xx, 푛 − 1) 푖=1 where 푊푝( 횺퐱퐱, 푛 − 1) is Wishart distribution with (푛 − 1) degree of freedom.

We have

푛 1 횺̂ = ∑(푿 − 푿̅ )(푿 − 푿̅ )푇 xx 푛 풊 푛 풊 푛 푖=1

So

푛횺̂xx ~ 푊푝( 횺xx, 푛 − 1)

By Nydick (2012), we have

퐸(푛횺̂xx) = (푛 − 1)횺xx

2 푉푎푟(푛Σ̂푖푗) = (푛 − 1)(Σ푖푗 + Σ푖푖Σ푗푗), 푖, 푗 = 1,2, … , 푝

So, the expectation of 횺̂xx is

푛 − 1 퐸(횺̂ ) = 횺 (3 - 36) xx 푛 xx

횺̂퐱퐱 is a biased estimator.

52 푛 − 1 푉푎푟(Σ̂ ) = (Σ2 + Σ Σ ) (3 - 37) 푖푗 푛2 푖푗 푖푖 푗푗

If we define

푛 푛 1 퐒 = 횺̂ = ∑(푿 − 푿̅ )(푿 − 푿̅ )푇 (3 - 38) xn 푛 − 1 xx 푛 − 1 풊 푛 풊 푛 푖=1

Then we have 푛 퐸(퐒 ) = 퐸(횺̂ ) = 횺 xn 푛 − 1 xx xx

퐒xn is an unbiased estimator for 횺̂xx.

3.2.3 Estimator of the Regression Coefficient Vector 휷̂

In this section, we will derive the conditional expectation and covariance matrix of

휷̂ given 푿 = 퐱 first, then derive the unconditional expectation and covariance matrix of the estimator.

푚 ̂ −ퟏ −ퟏ −ퟏ E(휷|퐱) = E(퐒xxm퐒xym|퐱) = 퐒xxmE(퐒xym|퐱) = 퐒xxmE {∑(퐱퐣 − 퐱̅퐦)(푌푗 − 푌̅푚)|퐱} 푗=1

푚 푚 −ퟏ −ퟏ 푇 = 퐒xxm ∑(퐱퐣 − 퐱̅퐦)E(푌푗 |퐱) = 퐒xxm ∑(퐱퐣 − 퐱̅퐦) [휇푦 + (퐱퐣 − 흁풙) 훃] 푗=1 푗=1

푚 풎 −ퟏ 푇 −ퟏ 푇 = 퐒xxm ∑(퐱퐣 − 퐱̅퐦) 퐱풋 휷 = 퐒xxm ∑(퐱퐣 − 퐱̅퐦)(퐱퐣 − 퐱̅퐦) 휷 푗=1 풋=ퟏ

−ퟏ = 퐒xxm퐒xxm휷 = 휷 (3 - 39)

So the unconditional expectation of 휷̂ is

퐸(휷̂) = 퐸[퐸(휷̂|퐱)] = 퐸(휷) = 휷 (3 - 40)

휷̂ is an unbiased estimator.

53 Similarly, we derive the conditional covariance first, then by the

Law of Total Covariance to obtain the unconditional covariance.

̂ −ퟏ −ퟏ −ퟏ 푇 Cov(휷|퐱) = Cov(퐒xxm퐒xym|퐱) = 퐒xxmCov(퐒xym|퐱)(퐒xxm)

푚 −ퟏ −ퟏ = 퐒xxmCov {∑(퐱퐣 − 퐱̅퐦)(푌푗 − 푌̅푚)|퐱} 퐒xxm 푗=1

푚 −ퟏ 푇 −ퟏ = 퐒xxm {∑(퐱퐣 − 퐱̅퐦)푉푎푟[(푌푗)|퐱] (퐱퐣 − 퐱̅퐦) } 퐒xxm 푗=1

푚 −ퟏ 2 푇 −ퟏ = 퐒xxm {∑(퐱퐣 − 퐱̅퐦)휎푒 (퐱퐣 − 퐱̅퐦) } 퐒xxm 푗=1

푚 −ퟏ 2 푇 −ퟏ 2 −ퟏ = 퐒xxm휎푒 {∑(퐱퐣 − 퐱̅퐦) (퐱퐣 − 퐱̅퐦) } 퐒xxm = 휎푒 퐒xxm (3 - 41) 푗=1

By the Law of Total Covariance and by Nydick (2012), we have

̂ ̂ ̂ 2 −ퟏ Cov(휷) = 퐸 [Cov(휷|퐱)] + Cov[퐸(휷|퐱)] = 퐸[휎푒 퐒xxm] + Cov(휷)

휎2 = 휎2퐸[퐒−ퟏ ] = 푒 횺−ퟏ (3 - 42) 푒 xxm 푚 − 푝 − 2 퐱퐱

−ퟏ where 퐒xxm has an inverse Wishart distribution

−ퟏ 퐒xxm ~ 퐼푛푣푊푝( 횺퐱퐱, 푚 − 1) (3 - 43)

When sample is large, 휷̂ has an asymptotically p-variate normal distribution.

3.2.4 Estimator of the Mean of 푌

As we do in 3.2.3, in this section, we will derive the conditional expectation and variance of 휇̂푦 given 푿 = 퐱 first, then derive the unconditional expectation and variance of the estimator.

54 ̂푇 ̂푇 퐸(휇̂푦|퐱) = 퐸{[푌̅푚 − 휷 (퐱̅푚 − 퐱̅푛)]|퐱} = 퐸(푌̅m|퐱) − 퐸(휷 |퐱)(퐱̅푚 − 퐱̅푛)

푇 푇 푇 = 휇푦 + 휷 (퐱̅푚 − 흁풙) − 휷 (퐱̅푚 − 퐱̅푛) = 휇푦 + 휷 (퐱̅푛 − 흁풙) (3 - 44) where

푚 푚 푚 1 1 1 퐸(푌̅ |퐱) = 퐸 ( ∑ 푌 |퐱 ) = ∑ 퐸(푌 |퐱 ) = ∑[휇 + 휷푇(퐱 − 흁 )] m 푚 푗 푗 푚 푗 푗 푚 푦 퐣 풙 푗=1 푗=1 푗=1

푚 1 = [푚휇 + 휷푇 (∑ 퐱 − 푚흁 )] = 휇 + 휷푇(퐱̅ − 흁 ) (3 - 45) 푚 푦 퐣 풙 푦 풎 풙 푗=1

So, we have

푇 푇 퐸(휇̂푦) = 퐸(퐸(휇̂푦|퐗)) = 퐸[휇푦 + 휷 (퐗̅풏 − 흁풙)] = 휇푦 + 휷 (흁풙 − 흁풙) = 휇푦 (3 - 46)

휇̂푦 is an unbiased estimator.

Since

푚 푚 1 1 휎2 푉푎푟(푌̅ |퐱) = 푉푎푟 ( ∑ 푌 |퐱 ) = ∑ 푉푎푟(푌 |퐱 ) = 푒 (3 - 47) m 푚 푗 푗 푚2 푗 푗 푚 푗=1 푗=1

̂푇 푇 ̂ 푇 ̂ 푉푎푟[휷 (퐱̅푚 − 퐱̅푛)|퐱] = 푉푎푟[(퐱̅푚 − 퐱̅푛) 휷|퐱] = (퐱̅푚 − 퐱̅푛) 푉푎푟(휷|퐱)(퐱̅푚 − 퐱̅푛)

2 푇 −ퟏ = 휎푒 (퐱̅푚 − 퐱̅푛) 퐒xxm(퐱̅푚 − 퐱̅푛) (3 - 48)

̂푇 ̂푇 푇 ̂푇 푇 Cov[푌̅m, 휷 (퐱̅푚 − 퐱̅푛)]|퐱 = Cov[(푌̅m, 휷 )|퐱](퐱̅푚 − 퐱̅푛) = [Cov(푌̅m, 훽푗 )|퐱](퐱̅푚 − 퐱̅푛) = 0

So, the conditional variance of 휇̂푦 is

̂푇 푉푎푟(휇̂푦|퐱) = 푉푎푟{[푌̅푚 − 휷 (퐱̅푚 − 퐱̅푛)]|퐱}

̂푇 ̂푇 = 푉푎푟(푌̅m|퐱) + 푉푎푟[휷 (퐱̅푚 − 퐱̅푛)|퐱] − 2퐶표푣[푌̅m, 휷 (퐱̅푚 − 퐱̅푛)]|퐱

55 휎2 = 푒 + 휎2(퐱̅ − 퐱̅ )푇퐒−ퟏ (퐱̅ − 퐱̅ ) (3 - 49) 푚 푒 푚 푛 xxm 푚 푛

To obtain the unconditional variance of 휇̂푦, we use the Law of Total Variance

푉푎푟(휇̂푦) = 푉푎푟[퐸(휇̂푦|퐱)] + 퐸[푉푎푟(휇̂푦|퐱)] now

1 푉푎푟[퐸(휇̂ |퐱)] = 푉푎푟[휇 + 휷푇(퐗̅ − 흁 )] = 휷푇푉푎푟(퐗̅ )휷 = 휷푇횺 휷 (3 - 50) 푦 푦 푛 풙 푛 푛 퐱퐱

To obtain 퐸[푉푎푟(휇̂푦|퐱)], we need to find out the distribution of (퐱̅푚 − 퐱̅푛) first.

1 푛 − 푚 퐗̅ − 퐗̅ = 퐗̅ − (푚퐗̅ + (푛 − 푚)퐗̅ ) = (퐗̅ − 퐗̅ ) (3 - 51) 푚 푛 푚 푛 푚 푛−푚 푛 푚 푛−푚

퐗̅푚 and 퐗̅푛−푚 are independent and normally distributed, and since

퐸(퐗̅푚 − 퐗̅푛−푚) = 퐸(퐗̅푚) − 퐸(퐗̅푛−푚) = 흁풙 − 흁풙 = ퟎ (3 - 52)

푛 Cov(퐗̅ − 퐗̅ ) = Cov(퐗̅ ) + Cov(퐗̅ ) = 횺 (3 - 53) 푚 푛−푚 푚 푛−푚 푚(푛 − 푚) 퐱퐱

So, we have 푛 퐗̅ − 퐗̅ ~ 푁 (ퟎ, 횺 ) 푚 푛−푚 푝 푚(푛 − 푚) 퐱퐱

Replace (퐗̅푚 − 퐗̅푛) with (퐗̅푚 − 퐗̅푛−푚),

푛 − 푚 2 (퐗̅ − 퐗̅ )푇퐒−ퟏ (퐗̅ − 퐗̅ ) = ( ) (퐗̅ − 퐗̅ )푇퐒−ퟏ (퐗̅ − 퐗̅ ) 푚 푛 xxm 푚 푛 푛 푚 푛−푚 xxm 푚 푛−푚

푛 − 푚 2 푛 1 (퐗̅ − 퐗̅ )푇 퐒 −1 (퐗̅ − 퐗̅ ) = ( ) ∙ ∙ 푚 푛−푚 ( xxm ) 푚 푛−푚 푛 푚(푛 − 푚) 푚 − 1 푛 푚 − 1 푛 √푚(푛 − 푚) √푚(푛 − 푚)

푛 − 푚 = 푇2 (3 - 54) 푚푛(푚 − 1) 푝,푚−1 where (퐗̅ − 퐗̅ )푇 퐒 −1 (퐗̅ − 퐗̅ ) 푇2 = 푚 푛−푚 ( xxm ) 푚 푛−푚 푝,푚−1 푛 푚 − 1 푛 √푚(푛 − 푚) √푚(푛 − 푚)

56 Hence,

휎2 퐸[푉푎푟(휇̂ |퐱)] = 퐸 [ 푒 + 휎2(퐗̅ − 퐗̅ )푇퐒−ퟏ (퐗̅ − 퐗̅ )] 푦 푚 푒 푚 푛 xxm 푚 푛

휎2 (푛 − 푚)휎2 = 푒 + 푒 퐸(푇2 ) 푚 푚푛(푚 − 1) 푝,푚−1

휎2 (푛 − 푚)휎2 (푚 − 1)푝 = 푒 + 푒 ∙ 푚 푚푛(푚 − 1) 푚 − 푝 − 2

휎2 (푛 − 푚)푝 = 푒 [1 + ] (3 - 55) 푚 푛(푚 − 푝 − 2)

where (푚 − 1)푝 (푚 − 1)푝 푚 − 푝 (푚 − 1)푝 퐸(푇2 ) = 퐸(퐹 ) = ∙ = 푝,푚−1 푚 − 푝 푝,푚−푝 푚 − 푝 푚 − 푝 − 2 푚 − 푝 − 2

Using (3 − 50) and (3 − 55), we have the unconditional variance of 휇̂푦

휎2 (푛 − 푚)푝 1 푉푎푟(휇̂ ) = 푒 [1 + ] + 휷푇횺 휷 (3 - 56) 푦 푚 푛(푚 − 푝 − 2) 푛 퐱퐱

휇̂푦 has an asymptotically normal distribution when sample is large.

3.2.5 Estimator of the Conditional Variance of Y given x

We use similar idea for the bivariate normal distribution. Since

푇 E(푌푗|퐱푗) = 휇푦 + 휷 (퐱j − 흁풙), 푗 = 1,2, … , 푚

For given 퐱푗, we may write

푇 Y푗 = 휇푦 + 휷 (퐱j − 흁풙) + 휀푗, 푗 = 1,2, … , 푚 (3 - 57) where

푇 퐸(휀푗) = 퐸(Y푗|퐱퐣) − 휇푦 − 휷 (퐱퐣 − 흁풙) = 0

2 Var(Yj|퐱j) = 푉푎푟(휀푗) = 휎푒

57 Then we have

̂푇 휀푗̂ = 푌푗 − 휇̂푦 − 휷 (푿푗 − 흁̂풙)

푇 푇 = 푌푗 − [푌̅푚 − 휷 (푿̅푚 − 푿̅푛)] − 휷 (푿푗 − 푿̅푛)

푇 = 푌푗 − 푌̅푚 − 휷 (푿푗 − 푿̅푚)

Hence

푚 1 2 1 휎̂2|퐱 = ∑{푌 − 푌̅ − 휷푇(푿 − 푿̅ )} |퐱 = 휺̂푇휺̂ (3 - 58) 푒 푚 푗 푚 푗 푚 푚 푗=1

We may rewrite (3 − 57) as

푇 푌푗 = 훽∗ + 휷 (퐱j − 퐱̅푚) + 휀푗, 푗 = 1,2, … , 푚 (3 - 59) where

푇 훽∗ = 휇푦 + 휷 (퐱̅푚 − 흁풙)

Equation (3 − 59) is the mean corrected form of the multiple regression model.

By Results 7.2 and Results 7,4 in Johnson and Wichern (1998),

푇 2 2 휺̂ 휺̂ ~휎푒 휒푚−푝−1

So

푚 − 푝 − 1 2(푚 − 푝 − 1) 퐸(휎̂2|퐱) = 휎2, 푉푎푟(휎̂2|퐱) = 휎4 푒 푚 푒 푒 푚2 푒

Hence, we have

푚 − 푝 − 1 퐸(휎̂2) = 퐸[퐸(휎̂2|퐱)] = 휎2 (3 - 60) 푒 푒 푚 푒

2 2 휎̂푒 is a biased estimator. The bias of 휎̂푒 is

푝 + 1 퐵푖푎푠(휎̂2, 휎2) = 퐸(휎̂2) − 휎2 = − 휎2 푒 푒 푒 푒 푚 푒

The bias vanishes as m → ∞.

2 The unconditional variance of 휎̂푒 is

58 2(푚 − 푝 − 1) 푉푎푟(휎̂2) = 푉푎푟[퐸(휎̂2|퐱)] + 퐸[푉푎푟(휎̂2|퐱)] = 휎4 (3 - 61) 푒 푒 푒 푚2 푒

휺̂푇휺̂ 휺̂푇휺̂ Since ⁄ 2 does not depend on x, so the unconditional ⁄ 2 also has 휎푒 휎푒

2 휒m−p−1 distribution, i.e.,

2 푚휎̂푒 2 2 ~ 휒푚−푝−1 휎푒

2 The mean square error for 휎̂푒 is

2 2 2 2 2 2 2 2 푀푆퐸(휎̂푒 ) = 퐸(휎̂푒 − 휎푒 ) = Var(휎̂푒 ) + [퐵푖푎푠(휎̂푒 , 휎푒 )]

2(푚 − 푝 − 1)휎4 (푝 + 1)2휎4 2푚 + (푝 + 1)(푝 − 1) = 푒 + 푒 = 휎4 (3 - 62) 푚2 푚2 푚2 푒

3.3 Prediction

Suppose we have a future observation (푋01, 푋02, … , 푋0푝, 푌0),

푿0 흁풙 횺퐱퐱 횺퐱퐲 ( ) ~푁푝+1 ([휇 ] , [ 2 ]) (3 - 63) 푌0 푦 횺퐲퐱 휎푦

Where 푿0 is a 푝 − dimensional vector.

We have following three kinds of prediction interval for 푌0:

1) Usual prediction interval for 푌0– conditioning on 퐗 = 퐱 푎푛푑 퐗0 = 퐱0

2) Prediction interval for 푌0– unconditional on 퐗, but conditioning on 퐗0 = 퐱0

3) Unconditional prediction interval for 푌0

3.3.1 Usual prediction interval

– Conditioning on 푿 = 풙 푎푛푑 푿0 = 풙0

59 The prediction value of 푌0 given 퐗 = 퐱 푎푛푑 퐗0 = 퐱0 is

̂푇 푌̂0|퐱, 퐱0 = 휇̂푦 + 휷 (퐱0 − 퐱̅푛) (3 - 64)

̂ By our assumption, (휇̂푦, 휷) are independent of (푿0, 푌0). By equation

(3 − 26), (3 − 39), (3 − 41) and since (푌|퐱) is normal, the distribution of (휷̂|퐱) is

̂ 2 −ퟏ (휷|퐱) ~ 푁푝(휷, 휎푒 퐒xxm) (3 - 65)

By equation (3 − 25), (3 − 44), (3 − 49) and since (푌|퐱) and (휷̂|퐱) are normal, the distribution of (휇̂푦|퐱) is

휎2 (휇̂ |퐱) ~ 푁 (휇 + 휷푇(퐱̅ − 흁 ), 푒 + 휎2(퐱̅ − 퐱̅ )푇퐒−ퟏ (퐱̅ − 퐱̅ )) (3 - 66) 푦 푦 푛 풙 푚 푒 푚 푛 xxm 푚 푛

So, we have

̂푇 퐸(푌̂0|퐱, 퐱0) = 퐸(휇̂푦|퐱, 퐱0) + 퐸(휷 |퐱, 퐱0)(퐱0 − 퐱̅푛)

̂푇 = 퐸(휇̂푦|퐱) + 퐸(휷 |퐱)(퐱0 − 퐱̅푛)

푇 푇 = 휇푦 + 휷 (퐱̅푛 − 흁풙) + 휷 (퐱0 − 퐱̅푛)

푇 = 휇푦 + 휷 (퐱0 − 흁풙) (3 - 67)

By our assumption,

푇 2 푌0|퐱, 퐱0 ~ 푁(휇푦 + 휷 (퐱0 − 흁풙), 휎푒 ) (3 - 68)

Then we have conditional variance of 푌̂0 as follows:

푇 ̂ ̂푇 푉푎푟(푌̂0|퐱, 퐱0) = 푉푎푟(휇̂푦|퐱) + (퐱0 − 퐱̅푛) Cov(휷|퐱)(퐱0 − 퐱̅푛) + 2퐶표푣[(휇̂푦, 휷 )|퐱](퐱0 − 퐱̅푛)

휎2 = 푒 + 휎2(퐱̅ − 퐱̅ )푇퐒−ퟏ (퐱̅ − 퐱̅ ) + 휎2(퐱 − 퐱̅ )푇퐒−ퟏ (퐱 − 퐱̅ ) 푚 푒 푚 푛 xxm 푚 푛 푒 0 푛 xxm 0 푛

2 푇 −ퟏ − 2휎푒 (퐱̅푚 − 퐱̅푛) 퐒xxm(퐱0 − 퐱̅푛)

휎2 = 푒 + 휎2(퐱 − 퐱̅ )푇퐒−ퟏ (퐱 − 퐱̅ ) (3 - 69) 푚 푒 0 푚 xxm 0 푚 where

̂푇 푇 ̂ ̂푇 퐶표푣[(휇̂푦, 휷 )|퐱] = 퐶표푣[(푌̅푚 − (퐱̅푚 − 퐱̅푛) 휷, 휷 )|퐱]

60 푇 ̂ 2 푇 −ퟏ = −(퐱̅푚 − 퐱̅푛) Cov(휷|퐱) = −휎푒 (퐱̅푚 − 퐱̅푛) 퐒xxm (3 - 70)

By (3 − 67) and (3 − 68), the expectation (푌0 − 푌̂0) given 퐗 = 퐱 푎푛푑 퐗0 = 퐱0 is

퐸(푌0 − 푌̂0)|퐱, 퐱0 = 퐸(푌0|퐱, 퐱0) − 퐸(푌̂0|퐱, 퐱0) = 0 (3 - 71)

By (3 − 68) and (3 − 69), the variance (푌0 − 푌̂0) given 퐗 = 퐱 푎푛푑 퐗0 = 퐱0 is

푉푎푟(푌0 − 푌̂0)|퐱, 퐱0 = 푉푎푟(푌0)|퐱, 퐱0 + 푉푎푟(푌̂0)|퐱, 퐱0 − 2퐶표푣(푌̂0, 푌0)|퐱, 퐱0

휎2 = 휎2 + 푒 + 휎2(퐱 − 퐱̅ )푇퐒−ퟏ (퐱 − 퐱̅ ) − 2 ∙ 0 푒 푚 푒 0 푚 xxm 0 푚 1 = 휎2 [1 + + (퐱 − 퐱̅ )푇퐒−ퟏ (퐱 − 퐱̅ )] (3 - 72) 푒 푚 0 푚 xxm 0 푚 where

̂푇 Cov(푌̂0, 푌0)|퐱, 퐱0 = Cov[휇̂푦 + 휷 (퐱0 − 퐱̅푛), 푌0]|퐱, 퐱0 = 0

Since 푌0 푎푛푑 푌̂0 are normal, so 푌0 − 푌̂0 is normal, then

(푌 − 푌̂ )|퐱, 퐱 − 퐸(푌 − 푌̂ )|퐱, 퐱 (푌 − 푌̂ )|퐱, 퐱 − 0 푍 = 0 0 0 0 0 0 = 0 0 0 ~ 푁(0,1) 1 √푉푎푟(푌 − 푌̂ )|퐱, 퐱 √휎2 [1 + + (퐱 − 퐱̅ )푇퐒−ퟏ (퐱 − 퐱̅ )] 0 0 0 푒 푚 0 푚 xxm 0 푚

Since

2 푚휎̂푒 2 푈 = 2 ~ 휒푚−푝−1 휎푒 and 푍 푎푛푑 푈 are independent, so

푍 푇 = ~ 푡푚−푝−1 √푈/(푚 − 푝 − 1) i.e.,

61 (푌0 − 푌̂0)|퐱, 퐱0 − 0 1 √휎2 [1 + + (퐱 − 퐱̅ )푇퐒−ퟏ (퐱 − 퐱̅ )] 푒 푚 0 푚 xxm 0 푚 푇 = 2 푚휎̂푒 √ 2 (푚 − 푝 − 1)휎푒

(푌0 − 푌̂0)|퐱, 퐱0 = ~ 푡푚−푝−1 푚휎̂2 1 √ 푒 [1 + + (퐱 − 퐱̅ )푇퐒−ퟏ (퐱 − 퐱̅ )] 푚 − 푝 − 1 푚 0 푚 xxm 0 푚

Hence, the 95% prediction interval for 푌0 푔푖푣푒푛 퐗 = 퐱 푎푛푑 퐗0 = 퐱0 is

푚휎̂2 1 푌̂ |퐱, 퐱 ± 푡 √ 푒 [1 + + (퐱 − 퐱̅ )푇퐒−ퟏ (퐱 − 퐱̅ )] (3 - 73) 0 0 0.025,푚−푝−1 푚 − 푝 − 1 푚 0 푚 xxm 0 푚

3.3.2 Prediction interval

– Unconditional on 푿, but conditioning on 푿0 = 풙0

In this situation, the prediction value of 푌0 given 퐗0 = 퐱0 is

̂푇 푌̂0|퐱0 = 휇̂푦 + 휷 (퐱0 − 흁̂풙) (3 - 74)

By (3 − 42) and (3 − 46), the expectation of 푌0 given 퐗0 = 퐱0 is

̂푇 퐸(푌̂0|퐱0) = 퐸(휇̂푦|퐱0) + 퐸[휷 (퐱0 − 흁̂풙)|퐱0]

̂푇 ̂푇 = 퐸(휇̂푦) + 퐸(휷 )퐱0 − 퐸(휷 흁̂풙)

푇 = 휇푦 + 휷 (퐱0 − 흁풙) (3 - 75) where

̂푇 푇 ̂ 푇 ̂ 푇 푇 퐸(휷 흁̂풙) = 퐸[퐸(흁풙휷)|퐱] = 퐸[흁풙퐸(휷)|퐱] = 휷 퐸(흁̂풙) = 휷 흁풙 (3 - 76)

̂푇 흁̂풙 푎푛푑 휷 are not correlated. By our assumption,

62 푇 2 푌0|퐱0 ~ 푁(휇푦 + 휷 (퐱0 − 흁풙), 휎푒 ) (3 - 77)

We derive the variance of 푌̂0|퐱0 by the ∆-Method. Let

푍1 휇̂푦 풁 = [푍2] = [흁̂풙] (3 - 78) 푍3 휷̂

풁 푎푛푑 푿0 are independent. So

퐸(푍1) 퐸(휇̂푦) 휇푦 E(풁|퐱0) = E(풁) = 흁푍 = [퐸(푍2)] = [퐸(흁̂풙)] = [흁풙] (3 - 79) 퐸(푍3) 퐸(휷̂) 휷

푉푎푟(푍1) 퐶표푣(푍1, 푍2) 퐶표푣(푍1, 푍3) ′ Cov(풁|퐱0) = Cov(풁) = 퐸[풁 − E(풁)][풁 − E(풁)] = [퐶표푣(푍2, 푍1) 푉푎푟(푍2) 퐶표푣(푍2, 푍3)] 퐶표푣(푍3, 푍1) 퐶표푣(푍3, 푍2) 푉푎푟(푍3)

2 ( ) 휎푒 푛 − 푚 푝 1 푇 1 푇 [1 + ] + 휷 횺퐱퐱휷 휷 횺퐱퐱 ퟎ 푚 푛(푚 − 푝 − 2) 푛 푛 1 1 = 횺퐱퐱휷 횺퐱퐱 ퟎ (3 - 80) 푛 푛 휎2 ퟎ ퟎ 푒 횺−ퟏ [ 푚 − 푝 − 2 퐱퐱 ]

where, by (3 − 56), we have

휎2 (푛 − 푚)푝 1 푉푎푟(푍 ) = Var(휇̂ ) = 푒 [1 + ] + 휷푇횺 휷 1 푦 푚 푛(푚 − 푝 − 2) 푛 퐱퐱

By (3 − 35), we have

1 푉푎푟(푍 ) = Cov(흁̂ ) = 횺 2 풙 푛 퐱퐱

By (3 − 42), we have

휎2 푉푎푟(푍 ) = Cov(휷̂) = 푒 횺−ퟏ 3 푚 − 푝 − 2 퐱퐱

63 ̂푇 퐶표푣(푍1, 푍2) = 퐶표푣(휇̂푦, 흁̂풙) = 퐶표푣[푌̅푚 − 휷 (푿̅풎 − 푿̅풏), 푿̅풏]

̂푇 = 퐶표푣(푌̅푚, 푿̅풏) − 퐶표푣[휷 (푿̅풎 − 푿̅풏), 푿̅풏]

푇 푇 ̂푇 푇 ̂푇 푇 = 퐸(푌̅푚푿̅푛) − 퐸(푌̅푚)퐸(푿̅푛) − 퐸[휷 (푿̅풎 − 푿̅풏)푿̅푛] + 퐸[휷 (푿̅풎 − 푿̅풏)]퐸(푿̅푛)

푇 푇 푇 푇 푇 = 퐸{[휇푦 + 휷 (푿̅풎 − 흁풙)]푿̅푛} − 휇푦흁풙 − 휷 퐸[(푿̅풎 − 푿̅풏)푿̅푛] + ퟎ

1 = −휷푇흁 흁푇 + 휷푇퐸(푿̅ 푿̅푇) = −휷푇흁 흁푇 + 휷푇[푉푎푟(푿̅ ) + 흁 흁푇] = 휷푇휮 풙 풙 풏 푛 풙 풙 풏 풙 풙 푛 풙풙

̂ ̂푇 Cov(휇̂푦, 휷) = ퟎ → Cov(푍1, 푍3) = Cov(휇̂푦, 휷 ) = ퟎ (1 × 푝)

̂ ̂푇 Cov(흁̂풙, 휷) = ퟎ → Cov(푍2, 푍3) = Cov(흁̂풙, 휷 ) = ퟎ (푝 × 푝)

In terms of 풛, we have

푇 푦̂0|퐱0 = 푧1 + 퐳3 (퐱0 − 풛2 ) (3 - 81)

By the Delta-method,

3 푇 ′ 푦̂0|퐱0 ≈ 휇푦 + 휷 (퐱0 − 흁풙) + ∑[푦̂0푗(흁푍|퐱0)]( 푧푗 − 휇푍푗) 푗=1

Where

휕푦 0 휕푧 푦̂′ (흁 |퐱 ) 1 1 01 푍 0 휕푦 ′ ′ 0 −휷 푦̂ 0(흁푍|퐱0) = [푦̂02(흁푍|퐱0)] = = [ ] (3 - 82) ′ 휕풛2 푦̂03(흁푍|퐱0) 퐱0 − 흁풙 휕푦0 [휕풛 ] 3 풛=흁푍

Hence the expectation of 푌̂0|퐱0 is

3 ̂ 푇 ′ 푇 E(푌0|퐱0) ≈ 휇푦 + 휷 (퐱0 − 흁풙) + ∑[푦̂0푗(흁푍|퐱0)]퐸( 푍푗 − 휇푍푗) = 휇푦 + 휷 (퐱0 − 흁풙) (3 - 83) 푗=1

64 The variance of 푌̂0|퐱0 is

2 Var(푌̂0|퐱0) = E[푌̂0|퐱0 − 퐸(푌̂0|퐱0)]

2 3 ′ ≈ E ⌈∑[푦̂0푗(흁푍|퐱0)]( 푍푗 − 휇푍푗)⌉ 푗=1

3 3 3 ′ 푇 ′ ′ 푇 ′ = ∑[푦̂0푗(흁푍|퐱0)] 푉푎푟(푍푗) [푦̂0푗(흁푍|퐱0)] + 2 ∑ ∑ [푦̂0푖(흁푍|퐱0)] 퐶표푣(푍푖, 푍푗) [푦̂0푗(흁푍|퐱0)] 푗=1 푖=1 푗=1,푖≠푗

푇 푇 = 푉푎푟(푍1) + (−휷 )푉푎푟(푍2)(−휷) + (퐱0 − 흁풙) 푉푎푟(푍3)(퐱0 − 흁풙) + 2 ∙ 1퐶표푣(푍1, 푍2)(−휷)

푇 푇 ̂ = 푉푎푟(휇̂푦) + (−휷 )퐶표푣(흁̂풙)(−휷) + (퐱0 − 흁풙) Cov(휷)(퐱0 − 흁풙) + 2 ∙ 1 ∙ 퐶표푣(휇̂푦, 흁̂풙)(−휷)

휎2 (푛 − 푚)푝 1 1 휎2 2 = 푒 [1 + ] + 휷푇횺 휷 + 휷푇횺 휷 + (퐱 − 흁 )푇 푒 횺−ퟏ(퐱 − 흁 ) − 휷푇횺 휷 푚 푛(푚 − 푝 − 2) 푛 퐱퐱 푛 퐱퐱 0 풙 푚 − 푝 − 2 퐱퐱 0 풙 푛 퐱퐱

1 (푛 − 푚)푝 (퐱 − 흁 )푇횺−ퟏ(퐱 − 흁 ) = 휎2 [ + + 0 풙 퐱퐱 0 풙 ] (3 - 84) 푒 푚 푚푛(푚 − 푝 − 2) 푚 − 푝 − 2

By (3 − 77) and (3 − 83), the expectation of (푌0 − 푌̂0) given 퐗0 = 퐱0 is

퐸(푌0 − 푌̂0)|퐱0 = 퐸(푌0|퐱0) − 퐸(푌̂0|퐱0) = 0 (3 - 85)

The variance of (푌0 − 푌̂0) given 퐗0 = 퐱0 is

푉푎푟(푌0 − 푌̂0)|퐱0 = 푉푎푟(푌0)|퐱0 + 푉푎푟(푌̂0)|퐱0 − 2퐶표푣(푌̂0, 푌0)|퐱0

1 (푛 − 푚)푝 (퐱 − 흁 )푇횺−ퟏ(퐱 − 흁 ) = 휎2 + 휎2 [ + + 0 풙 퐱퐱 0 풙 ] − 2 ∙ 0 푒 푒 푚 푚푛(푚 − 푝 − 2) 푚 − 푝 − 2

65 1 (푛 − 푚)푝 (퐱 − 흁 )푇횺−ퟏ(퐱 − 흁 ) = 휎2 [1 + + + 0 풙 퐱퐱 0 풙 ] (3 - 86) 푒 푚 푚푛(푚 − 푝 − 2) 푚 − 푝 − 2

where ̂푇 Cov(푌̂0, 푌0)|퐱0 = Cov[휇̂푦 + 휷 (퐱0 − 흁̂풙), 푌0]|퐱0 = 0

When sample is large and by (3 − 85) and (3 − 86)

(푌 − 푌̂ )|퐱 − 0 푍 = 0 0 0 ~̇ 푁(0,1) 1 (푛 − 푚)푝 (퐱 − 흁 )푇횺−ퟏ(퐱 − 흁 ) √휎2 [1 + + + 0 풙 퐱퐱 0 풙 ] 푒 푚 푚푛(푚 − 푝 − 2) 푚 − 푝 − 2

The 95% prediction interval for 푌0 푔푖푣푒푛 퐗0 = 퐱0 is

1 (푛 − 푚)푝 (퐱 − 퐱̅ )푇퐒−ퟏ(퐱 − 퐱̅ ) 푌̂ |퐱 ± 푧 √S2 [1 + + + 0 푛 퐱퐧 0 푛 ] (3 - 87) 0 0 0.025 푒 푚 푚푛(푚 − 푝 − 2) 푚 − 푝 − 2

Where

푚 1 2 S2 = ∑{푌 − 푌̅ − 휷̂푇(퐗 − 퐗̅ )} (3 - 88) 푒 푚 − 푝 − 1 푗 m j 풎 푗=1

퐒퐱n is given in (3 − 38) as

푛 1 퐒 = ∑(푿 − 푿̅ ) (푿 − 푿̅ )푇 퐱n 푛 − 1 풊 풏 풊 풏 푖=1

3.3.3 Unconditional prediction interval

By assumptions in (3 − 63) , we have

66 2 푌0 ~ 푁(휇푦, 휎푦 ) (3 - 89)

푿0 ~ 푁푝(흁퐱, Σ퐱퐱) (3 - 90)

The prediction value of 푌0 is

̂푇 푌̂0 = 휇̂푦 + 휷 (푿0 − 흁̂퐱) (3 - 91)

Let

푍 휇̂푦 1 푍 흁̂ 풁 = [ 2] = 퐱 (3 - 92) 푍3 휷̂ 푍 4 [푿0]

We have

퐸(푍 ) 퐸(휇̂푦) 휇 1 푦 퐸(푍2) 퐸(흁̂퐱) 흁퐱 E(풁) = 흁푍 = [ ] = = [ ] (3 - 93) 퐸(푍3) 퐸(휷̂) 휷 흁퐱 퐸(푍4) [퐸(푿0)]

푉푎푟(푍 ) 퐶표푣(푍 , 푍 ) 퐶표푣(푍 , 푍 ) 퐶표푣(푍 , 푍 ) 1 1 2 1 3 1 4 퐶표푣(푍 , 푍 ) 푉푎푟(푍 ) 퐶표푣(푍 , 푍 ) 퐶표푣(푍 , 푍 ) Cov(풁) = 퐸[풁 − E(풁)][풁 − E(풁)]푇 = 2 1 2 2 3 2 4 퐶표푣(푍3, 푍1) 퐶표푣(푍3, 푍2) 푉푎푟(푍3) 퐶표푣(푍3, 푍4) [퐶표푣(푍4, 푍1) 퐶표푣(푍4, 푍2) 퐶표푣(푍4, 푍3) 푉푎푟(푍4) ]

2 ( ) 휎푒 푛 − 푚 푝 1 푇 1 푇 [1 + ] + 휷 횺퐱퐱휷 휷 횺퐱퐱 0 0 푚 푛(푚 − 푝 − 2) 푛 푛 1 1 횺 휷 횺 0 0 = 푛 퐱퐱 푛 퐱퐱 (3 - 94) 휎2 0 0 푒 횺−ퟏ 0 푚 − 푝 − 2 퐱퐱 0 0 [ 0 횺퐱퐱]

In terms of 풛, we have

푇 푦̂0 = 푧1 + 퐳3 (풛4 − 풛2 ) (3 - 95)

By the Delta-method,

67 4 ′ 푦̂0 ≈ 휇푦 + ∑ 푦̂0푗(흁푍)( 푧푗 − 휇푍푗) 푗=1 where

휕푦 0 휕푧1 ′ 푦̂ (흁 ) 휕푦0 01 푍 1 푦̂′ (흁 ) ′ 02 푍 휕풛2 −휷 푦̂ 0(흁푍) = ′ = = [ ] (3 - 96) 푦̂03(흁푍) 휕푦0 ퟎ

[푦̂′ (흁 )] 휷 04 푍 휕풛3 휕푦0 [휕풛 ] 4 풛=흁푍

so, the expectation of 푌̂0 is

4 ′ E(푌̂0) ≈ 휇푦 + ∑ 푦̂0푗(흁푍)퐸( 푍푗 − 휇푍푗) = 휇푦 (3 - 97) 푗=1

The variance of 푌̂0 is

2 Var(푌̂0) = E[푌̂0 − 퐸(푌̂0)]

2 4 ′ ≈ E ⌈∑ 푦̂0푗(흁푍)퐸( 푍푗 − 휇푍푗)⌉ 푗=1

4 4 4 ′ 푇 ′ ′ 푇 ′ = ∑[푦̂0푗(흁푍|푥0)] 푉푎푟(푍푗) [푦̂0푗(흁푍|푥0)] + 2 ∑ ∑ [푦̂0푖(흁푍|푥0)] 퐶표푣(푍푖, 푍푗) [푦̂0푗(흁푍|푥0)] 푗=1 푖=1 푗=1,푖≠푗

푇 푇 = 푉푎푟(푍1) + (−휷 )푉푎푟(푍2)(−휷) + ퟎ ∙ 푉푎푟(푍3) + 휷 푉푎푟(푍4)휷 + 2 ∙ 1 ∙ 퐶표푣(푍1, 푍2)(−휷)

푇 ̂ 푇 = 푉푎푟(휇̂푦) + (−휷 )퐶표푣(흁̂풙)(−휷) + ퟎ ∙ 퐶표푣(휷) + 휷 퐶표푣(푿0)휷 + 2 ∙ 1 ∙ 퐶표푣(휇̂푦, 흁̂풙)(−휷)

휎2 (푛 − 푚)푝 1 1 2 = 푒 [1 + ] + 휷푇횺 휷 + 휷푇횺 휷 + 0 + 휷푇횺 휷 − 휷푇횺 휷 푚 푛(푚 − 푝 − 2) 푛 퐱퐱 푛 퐱퐱 퐱퐱 푛 퐱퐱

68 휎2 (푛 − 푚)푝 = 푒 [1 + ] + 휷푇횺 휷 (3 - 98) 푚 푛(푚 − 푝 − 2) 퐱퐱

By (3 − 89) and (3 − 97), the expectation of (푌0 − 푌̂0) is

퐸(푌0 − 푌̂0) = 퐸(푌0) − 퐸(푌̂0) = 0 (3 - 99)

By (3 − 6), (3 − 63), (3 − 89) and (3 − 98), the variance of (푌0 − 푌̂0) is

푉푎푟(푌0 − 푌̂0) = 푉푎푟(푌0) + 푉푎푟(푌̂0) − 2퐶표푣(푌̂0, 푌0)

휎2 (푛 − 푚)푝 = 휎2 + 푒 [1 + ] + 휷푇횺 휷 − 2 ∙ 휷푇횺 휷 푦 푚 푛(푚 − 푝 − 2) 퐱퐱 퐱퐱

휎2 (푛 − 푚)푝 = 휎2 + 푒 [1 + ] − 휷푇횺 휷 푦 푚 푛(푚 − 푝 − 2) 퐱퐱

1 (푛 − 푚)푝 = 휎2 [1 + + ] (3 - 100) 푒 푚 푛푚(푚 − 푝 − 2)

Where

̂푇 Cov(푌̂0, 푌0) = Cov[휇̂푦 + 휷 (푿0 − 흁̂퐱), 푌0]

̂푇 ̂푇 = Cov(휇̂푦, 푌0) + Cov(휷 퐗0, 푌0) − Cov(휷 흁̂퐱, 푌0)

̂푇 ̂푇 = 0 + 퐸(휷 퐗0푌0) − 퐸(휷 퐗0)퐸(푌0) − 0

̂푇 ̂푇 = 퐸(휷 )E(퐗0푌0) − 퐸(휷 )퐸(퐗0)퐸(푌0)

푇 푇 푇 = 휷 [퐸(퐗0푌0) − 퐸(퐗0)퐸(푌0)] = 휷 횺퐱푦 = 휷 횺퐱퐱휷 (3 - 101)

When sample is large,

(푌 − 푌̂ ) − 0 푍 = 0 0 ~̇ 푁(0,1) 1 (푛 − 푚)푝 √휎2 [1 + + ] 푒 푚 푛푚(푚 − 푝 − 2)

The 95% prediction interval for 푌0 is

1 (푛 − 푚)푝 푌̂ ± 푧 √S2 [1 + + ] (3 - 102) 0 0.025 푒 푚 푛푚(푚 − 푝 − 2)

2 where S푒 is given in (3 − 88) as

69 푚 1 2 S2 = ∑{푌 − 푌̅ − 휷̂푇(퐗 − 퐗̅ )} 푒 푚 − 푝 − 1 푗 m j 풎 푗=1

This is unconditional (not dependent on 푿0) prediction interval for 푌0 of a future observation.

3.4 An example for multiple regression model

We use the data from Han and Li (2011) to estimate those 5 MLE estimators

(data in Table 1 – 1). In this example, we take TOEFL score as 푌, GRE Verbal, GRE

Quantitative and GRE Analytic as 푋1, 푋2 푎푛푑 푋3, respectively.

푮푹푬 푽풆풓풃풂풍 퐱 = [푮푹푬 푸풖풂풏풕풊풕풂풕풊풗풆] 푦 = 푻푶푬푭푳 푮푹푬 푨풏풂풍풚풕풊풄

푇 Normality test shows that Y and 퐱 = [푥1, 푥2 , 푥3] are normally distributed.

The five estimators are:

419.5 0.1776 ̂ 훍̂퐱 = [646.5] 휇̂푦 = 563 휷 = [−0.1122] 523 0.0513

11614.75 −4656.75 5825.25 2 횺̂퐱퐱 = [−4656.75 11707.75 1510.5 ] σ̂e = 776 5825.25 1510.5 12239.75

If we do not consider extra information, we have

342 0.1776 ̂ 훍̂퐱 = [710.5] 휇̂푦 = 539.05 휷 = [−0.1122] 468.5 0.0513

3276 1324 2585.5 2 횺̂퐱퐱 = [ 2585.5 5694.75 4525.75 ] σ̂e = 776 5825.25 4525.75 11500.25

70

Chapter 4

STATISTICAL ESTIMATION IN MULTIVARIATE REGRESSION MODEL

WITH A BLOCK OF MISSING OBSERVATIONS

푿 흁풙 Let [ ] have a multivariate normal distribution with mean vector [ ] and 풀 흁풚 covariance matrix

횺퐱퐱 횺퐱퐲 횺 = [ ] 횺퐲퐱 횺퐲퐲 where 푿 is a 푝 × 1 vector and 풀 is a 푞 × 1 vector.

Suppose the following random sample with a block of missing 풀 values are obtained:

푋1,1 푋1,2 ⋯ 푋1,푝 푌1,1 푌1,2 ⋯ 푌1,푞

푋2,1 푋2,2 ⋯ 푋2,푝 푌2,1 푌2,2 ⋯ 푌2,푞

⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮

푋푚,1 푋푚,2 ⋯ 푋푚,푝 푌푚,1 푌푚,2 ⋯ 푌푚,푞

푋푚+1,1 푋푚+1,2 ⋯ 푋푚+1,푝

⋮ ⋮ ⋮ ⋮

푋푛,1 푋푛,1 ⋯ 푋푛,푝

Based on the data, We want to estimate the parameters. We can write the multivariate normal probability density function (pdf) as

푓(퐱, 풚) = 푔(풚|퐱)ℎ(퐱) (4- 1) where 푔(풚|퐱) is the conditional pdf of 풀 given 푿 = 퐱, and ℎ(퐱) is the marginal pdf of 푿.

71 1 1 푇 푔 (퐲 |퐱 ; 훍 , 훍 , 휷, 횺 ) = exp {− [퐲 − 퐸(퐲 |퐱 )] 횺−1[퐲 − 퐸(퐲 |퐱 )]} 풀|푿 풋 풋 퐲 퐱 e 2 퐣 풋 풋 푒 퐣 풋 풋 √2π|횺e|

1 1 푇 푇 −1 푇 = exp {− [풚풋 − 흁풚 − 휷 (풙풋 − 흁풙)] 횺푒 [풚풋 − 흁풚 − 휷 (풙풋 − 흁풙)]} (4- 2) 2 √2π|횺e|

1 1 푇 −1 ℎ푿(퐱퐢; 훍퐱, 횺xx) = exp {− (퐱퐢 − 훍퐱) 횺xx (퐱퐢 − 훍퐱)} (4- 3) 2 √2π|횺xx| where

−ퟏ 푇 퐸(퐲풋|퐱풋) = 흁풚 + 횺yx횺xx (퐱퐣 − 훍퐱) = 흁풚 + 휷 (풙풋 − 흁풙) (4 - 4)

−ퟏ 휷 = 횺퐱퐱 횺퐱퐲 (4 - 5)

−ퟏ 푇 횺e = 횺yy − 횺yx횺xx 횺xy = 횺yy − 휷 횺xx휷 (4 - 6)

푖 = 1,2, … , 푛. 푗 = 1,2, … , 푚.

The joint likelihood function is

m n

L(훍퐲, 훍퐱, 훃, 횺퐱퐱, 횺퐞) = ∏ 푔푌|푿(퐲풋|퐱풋; 훍퐲, 훍퐱, 휷, 횺e) ∏ ℎ푿(퐱퐢; 훍퐱, 횺퐱퐱) j=1 i=1

L(훍퐲, 훍퐱, 훃, 횺퐱퐱, 횺퐞)

푚 푛+푚 푚 − − 1 푇 = (2휋) 2 |횺 | 2 exp {− ∑[풚 − 흁 − 휷푇(풙 − 흁 )] 횺−1[풚 − 흁 − 휷푇(풙 − 흁 )]} e 2 풋 풚 풋 풙 e 풋 풚 풋 풙 푗=1

푛 푛 − 1 ∙ |횺 | 2푒푥푝 {− ∑(퐱 − 훍 )푇횺−1(퐱 − 훍 )} (4 - 7) xx 2 퐢 퐱 xx 퐢 퐱 푖=1

4.1 Maximum Likelihood Estimators

To obtain maximum likelihood estimators, we need to maximize the following

(4 − 8) and (4 − 9) simultaneously.

72 푛 푛 − 1 |횺 | 2푒푥푝 {− ∑(풙 − 흁 )푇횺−1(풙 − 흁 )} (4 - 8) xx 2 풊 풙 xx 풊 풙 푖=1

푚 푚 − 1 푇 |횺 | 2 exp {− ∑[풚 − 흁 − 휷푇(풙 − 흁 )] 횺−1[풚 − 흁 − 휷푇(풙 − 흁 )]} (4 - 9) e 2 풋 풚 풋 풙 e 풋 풚 풋 풙 푗=1

Let us consider the exponent first and find the MLE of 흁풚 , 흁풙 and 휷 to minimize

푚 1 푇 ∑[풚 − 흁 − 휷푇(풙 − 흁 )] 횺−1[풚 − 흁 − 휷푇(풙 − 흁 )] 2 풋 풚 풋 풙 e 풋 풚 풋 풙 푗=1 (4 - 10) 푛 1 + ∑(풙 − 흁 )푇횺−1(풙 − 흁 ) 2 풊 풙 xx 풊 풙 푖=1

Since the sum of trace of the matrix is equal to the trace of sum of the matrix, we have

푛 푛 1 1 ∑(풙 − 흁 )푇횺−1(풙 − 흁 ) = ∑ 푡푟[(풙 − 흁 )푇횺−1(풙 − 흁 )] 2 풊 풙 xx 풊 풙 2 풊 풙 xx 풊 풙 푖=1 푖=1

푛 1 = 푡푟 [∑ 횺−1(풙 − 흁 )(풙 − 흁 )푇] 2 퐱퐱 풊 풙 풊 풙 푖=1

푛 1 = 푡푟 [∑ 횺−1[(풙 − 풙̅ ) + (풙̅ − 흁 )][(풙 − 풙̅ ) + (풙̅ − 흁 )]푇] 2 퐱퐱 풊 풏 풏 풙 풊 풏 풏 풙 푖=1

푛 1 푛 = 푡푟 ⌈횺−1 ∑(풙 − 풙̅ )(풙 − 풙̅ )푇⌉ + (풙̅ − 흁 )푇횺−1(풙̅ − 흁 ) (4 - 11) 2 퐱퐱 풊 풏 풊 풏 2 풏 풙 퐱퐱 풏 풙 푖=1 where

푛 푛 푛 푇 1 푇 풙̅ = [∑ 푥 , ∑ 푥 , … , ∑ 푥 ] = [풙̅ , 풙̅ , … , 풙̅ ] (4 - 12) 풏 푛 푖1 푖2 푖푝 푛1 푛2 푛푝 푖=1 푖=1 푖=1 and cross-product terms

푛 푛 푇 푇 ∑(퐱풊 − 퐱̅푛)(퐱̅푛 − 흁풙) = [∑(퐱풊 − 퐱̅푛)] (퐱̅푛 − 흁풙) = ퟎ 푖=1 푖=1

73 푛 푛 푇 푇 ∑(퐱̅푛 − 흁풙)(퐱풊 − 퐱̅푛) = (퐱̅푛 − 흁풙) [∑(퐱풊 − 퐱̅푛) ] = ퟎ 푖=1 푖=1

Similarly, we have

푚 1 푇 ∑[풚 − 흁 − 휷푇(풙 − 흁 )] 횺−1[풚 − 흁 − 휷푇(풙 − 흁 )] 2 풋 풚 풋 풙 e 풋 풚 풋 풙 푗=1

푚 1 푇 = 푡푟 {∑ 횺−1[풚 − 흁 − 휷푇(풙 − 흁 )][풚 − 흁 − 휷푇(풙 − 흁 )] } 2 푒 풋 풚 풋 풙 풋 풚 풋 풙 푗=1

푚 1 푇 = 푡푟 {∑ 횺−1[풚 − 풚̅ − 휷푇(풙 − 풙̅ )][풚 − 풚̅ − 휷푇(풙 − 풙̅ )] } 2 푒 풋 푚 풋 풎 풋 푚 풋 풎 푗=1

푚 푇 + [풚̅ − 흁 − 휷푇(풙̅ − 흁 )] 횺−1 [풚̅ − 흁 − 휷푇(퐱̅ − 훍 )] (4 - 13) 2 푚 풚 풎 풙 푒 푚 풚 퐦 퐱 where

푚 푚 푚 푇 1 푇 퐱̅ = [∑ 푥 , ∑ 푥 , … , ∑ 푥 ] = [퐱̅ , 퐱̅ , … , 퐱̅ ] (4 - 14) 풎 푚 푖1 푖2 푖푝 푚1 푚2 푚푝 푖=1 푖=1 푖=1

푇 푚 푚 푚 1 푇 풚̅ = [∑ 푦 , ∑ 푦 , … , ∑ 푦 ] = [퐲̅ , 퐲̅ , … , 퐲̅ ] (4 - 15) 푚 푚 푗1 푗2 푗푞 푚1 푚2 푚푞 푗=1 푗=1 푗=1

푚 푇 ∑[퐲퐣 − 풚̅푚 − 휷 (퐱퐣 − 퐱̅퐦)] = 0 푗=1

Hence, minimizing (4 − 11) and (4 − 13) simultaneously will minimize (4 − 10).

−1 First, let us consider (4 − 11). Since 횺퐱퐱 is positive definite, each term in (4 − 11)

푇 −1 is greater than or equal to zero. The second term 푛(퐱̅푛 − 흁풙) 횺퐱퐱 (퐱̅푛 − 흁풙)/2 can be minimized if we set 푛 (풙̅ − 흁 )푇횺−1(풙̅ − 흁 ) = 0 (4 - 16) 2 풏 풙 퐱퐱 풏 풙

74 −1 Second, let us consider (4 − 13). Similarly, since 횺퐞 is positive definite, each term in (4 − 13) is greater than or equal to zero. To minimize the first term in (4 − 13), i.e.,

1 푇 Min{ 푡푟 {∑푚 횺−1[풚 − 풚̅ − 휷푇(풙 − 풙̅ )][풚 − 풚̅ − 휷푇(풙 − 풙̅ )] }} 2 푗=1 푒 풋 푚 풋 풎 풋 푚 풋 풎

We take derivative with respect to 휷 first, then set the derivative to zero, and obtain the MLE for 휷 which makes the above minimum. We used the derivatives of trace for the first and second order in Petersen and Pedersen (2012),

푚 휕 1 푇 { 푡푟푎푐푒횺−1 ∑[퐲 − 풚̅ − 휷푇(퐱 − 퐱̅ )][퐲 − 풚̅ − 휷푇(퐱 − 퐱̅ )] } 휕휷 2 푒 퐣 푚 퐣 퐦 퐣 푚 퐣 퐦 푗=1

푚 1 휕 푇 = {푡푟푎푐푒 ∑ 횺−1[퐲 − 풚̅ − 휷푇(퐱 − 퐱̅ )][퐲 − 풚̅ − 휷푇(퐱 − 퐱̅ )] } 2 휕휷 푒 퐣 푚 퐣 퐦 퐣 푚 퐣 퐦 푗=1

푚 1 휕 푇 푇 = {푡푟푎푐푒 ∑ 횺−1 [(퐲 − 풚̅ )(퐲 − 풚̅ ) − 휷푇(퐱 − 퐱̅ )(퐲 − 풚̅ ) 2 휕휷 푒 퐣 푚 퐣 푚 퐣 퐦 퐣 푚 푗=1

푇 푇 푇 − (퐲퐣 − 풚̅푚)(퐱퐣 − 퐱̅퐦) 휷 + 휷 (퐱퐣 − 퐱̅퐦)(퐱퐣 − 퐱̅퐦) 휷]}

푚 1 푇 푇 푇 = ∑ {0 − (퐱 − 퐱̅ )(퐲 − 풚̅ ) 횺−1 − (퐱 − 퐱̅ )(퐲 − 풚̅ ) 횺−1 + (퐱 − 퐱̅ )(퐱 − 퐱̅ ) 휷횺−1 2 퐣 퐦 퐣 푚 푒 퐣 퐦 퐣 푚 푒 퐣 퐦 퐣 퐦 푒 퐽=1

푇 푇 −1 푇 + [(퐱퐣 − 퐱̅퐦)(퐱퐣 − 퐱̅퐦) ] 휷[횺푒 ] }

푚 푇 −1 푇 −1 = ∑ −(퐱퐣 − 퐱̅퐦)(퐲퐣 − 풚̅푚) 횺푒 + (퐱퐣 − 퐱̅퐦)(퐱퐣 − 퐱̅퐦) 휷횺푒 (4 - 17) 퐽=1

By equations (102) and (117) in Petersen and Pedersen (2012),

75 휕 푇 푇 {푡푟푎푐푒 [횺−1휷푇(퐱 − 퐱̅ )(퐲 − 풚̅ ) ]} = (퐱 − 퐱̅ )(퐲 − 풚̅ ) 횺−1 휕휷 푒 퐣 퐦 퐣 푚 퐣 퐦 퐣 푚 푒

휕 푇 {푡푟푎푐푒 [횺−1(퐲 − 풚̅ )(퐱 − 퐱̅ ) 휷]} 휕휷 푒 퐣 푚 퐣 퐦

휕 푇 푇 = {푡푟푎푐푒 [횺−1(퐲 − 풚̅ )(퐱 − 퐱̅ ) 휷] } 푡푟(푨) = 푡푟(푨푇) 휕휷 푒 퐣 푚 퐣 퐦

휕 푇 = {푡푟푎푐푒 [휷푇(퐱 − 퐱̅ )(퐲 − 풚̅ ) 횺−1]} 휕휷 퐣 퐦 퐣 푚 푒

휕 푇 {푡푟푎푐푒 [횺−1휷푇(퐱 − 퐱̅ )(퐲 − 풚̅ ) ]} 휕휷 푒 퐣 퐦 퐣 푚

푇 −1 = (퐱퐣 − 퐱̅퐦)(퐲퐣 − 풚̅푚) 횺푒

휕 푇 {푡푟푎푐푒 [횺−1휷푇(퐱 − 퐱̅ )(퐱 − 퐱̅ ) 휷]} 휕휷 푒 퐣 퐦 퐣 퐦

휕 푇 = {푡푟푎푐푒 [휷푇(퐱 − 퐱̅ )(퐱 − 퐱̅ ) 휷횺−1]} 휕휷 퐣 퐦 퐣 퐦 푒

푇 푇 −1 푇 −1 푇 = (퐱퐣 − 퐱̅퐦)(퐱퐣 − 퐱̅퐦) 휷횺푒 + [(퐱퐣 − 퐱̅퐦)(퐱퐣 − 퐱̅퐦) ] 휷[횺푒 ]

푇 −1 = 2(퐱퐣 − 퐱̅퐦)(퐱퐣 − 퐱̅퐦) 휷횺푒

Set (4 − 17) = ퟎ to give

푚 푇 −1 푇 −1 ∑ −(퐱퐣 − 퐱̅퐦)(퐲퐣 − 풚̅푚) 횺푒 + (퐱퐣 − 퐱̅퐦)(퐱퐣 − 퐱̅퐦) 휷횺푒 = ퟎ (4 - 18) 퐽=1

푇 푇 −1 푇 The second term 푚[풚̅푚 − 흁풚 − 휷 (풙̅풎 − 흁풙)] 횺푒 [풚̅푚 − 흁풚 − 휷 (퐱̅퐦 − 훍퐱)]/2 can be minimized if we set

푚 푇 [풚̅ − 흁 − 휷푇(풙̅ − 흁 )] 횺−1 [풚̅ − 흁 − 휷푇(풙̅ − 흁 )] = 0 (4 - 19) 2 푚 풚 풎 풙 푒 푚 풚 풎 풙

76 Simultaneously solving (4 − 16), (4 − 18) and (4 − 19), we obtain the MLE for

흁풚 , 흁풙 and 휷 as follows:

흁̂풙 = 푿̅풏 (4 - 20)

̂푇 흁̂풚 = 풀̅푚 − 휷 (푿̅풎 − 푿̅풏) (4 - 21)

̂ −ퟏ 휷 = 퐒xxm퐒xym (4 - 22) where

퐦 푇 퐒퐱퐱퐦 = ∑(퐗퐣 − 퐗̅퐦)((퐗퐣 − 퐗̅퐦) (4 - 23) 퐣=ퟏ

푚 푇 퐒퐱퐲퐦 = ∑(퐗퐣 − 퐗̅퐦) (퐘j − 퐘̅퐦) (4 - 24) 푗=1

Now back to maximize (4 − 8) and (4 − 9) simultaneously. Since when 흁̂풙 = 푿̅풏,

(4 − 8) is reduced to

푛 푛 − 1 |횺 | 2 exp {− ∑(퐱 − 훍 )푇횺−1(퐱 − 훍 )} 퐱퐱 2 퐢 퐱 퐱퐱 퐢 퐱 푖=1

푛 푛 − 1 = |횺 | 2푒푥푝 {− 푡푟(횺−1 ∑(퐱 − 퐱̅ )(퐱 − 퐱̅ )푇} (4 - 25) 퐱퐱 2 퐱퐱 퐢 퐧 퐢 퐧 푖=1

By Results 4.10 in Johnson and Wichern (1998), (4 − 25) reaches maximum when

풏 ퟏ 횺̂ = ∑(퐗 − 퐗̅ )(퐗 − 퐗̅ )푇 (4 - 26) 퐱퐱 풏 퐢 퐧 퐢 퐧 풊=ퟏ

̂푇 ̂ −ퟏ Similarly, when 흁̂풚 = 풀̅푚 − 휷 (푿̅풎 − 푿̅풏) , 흁̂풙 = 푿̅풏 and 휷 = 퐒xxm퐒xym, (4 − 9) is reduced to

77 푚 푚 − 1 푇 |횺 | 2 exp {− ∑[풚 − 흁 − 휷푇(풙 − 흁 )] 횺−1[풚 − 흁 − 휷푇(풙 − 흁 )]} e 2 풋 풚 풋 풙 e 풋 풚 풋 풙 푗=1

푚 푚 − 1 푇 = |횺 | 2 푒푥푝 {− 푡푟횺−1 ∑[퐲 − 풚̅ − 휷̂푇(퐱 − 퐱̅ )][퐲 − 풚̅ − 휷̂푇(퐱 − 퐱̅ )] } (4 - 27) e 2 푒 퐣 푚 퐣 퐦 퐣 푚 퐣 퐦 푗=1

Again, by Results 4.10 in Johnson and Wichern (1998), (4 − 27) reaches maximum when

푚 1 푇 횺̂ = ∑[풀 − 풀̅ − 휷̂푇(푿 − 푿̅ )][풀 − 풀̅ − 휷̂푇(푿 − 푿̅ )] (4 - 28) e 푚 풋 푚 풋 풎 풋 푚 풋 풎 푗=1

In summary, we have following 5 maximum likelihood estimators:

흁̂풙 = 푿̅풏 (4 - 29)

̂푇 흁̂풚 = 풀̅푚 − 휷 (푿̅풎 − 푿̅풏) (4 - 30)

̂ −ퟏ 휷 = 퐒xxm퐒xym (4 - 31)

푛 1 횺̂ = ∑(퐗 − 퐗̅ ) (퐗 − 퐗̅ )푇 (4 - 32) 퐱퐱 푛 퐢 풏 퐢 풏 푖=1

푚 1 푇 횺̂ = ∑[풀 − 풀̅ − 휷̂푇(푿 − 푿̅ )][풀 − 풀̅ − 휷̂푇(푿 − 푿̅ )] (4 - 33) e 푚 풋 푚 풋 풎 풋 푚 풋 풎 푗=1

Similarly, if we do not consider extra information 푿m+1,푿m+2, … , 푿n and only use the first 푚 observations, we have

흁̂풙_푛표 = 푿̅풎 (4 - 34)

78 흁̂풚_푛표 = 풀̅퐦 (4 - 35)

̂ −ퟏ 휷푛표 = 퐒xxm퐒xym (4 - 36)

푚 1 횺̂ = ∑(퐗 − 퐗̅ ) (퐗 − 퐗̅ )푇 (4 - 37) 퐱퐱_푛표 푚 퐢 풎 퐢 풎 푖=1

푚 1 푇 횺̂ = ∑[풀 − 풀̅ − 휷̂푇(푿 − 푿̅ )][풀 − 풀̅ − 휷̂푇(푿 − 푿̅ )] (4 - 38) e_푛표 푚 풋 푚 풋 풎 풋 푚 풋 풎 푗=1

4.2 Properties of the Maximum Likelihood Estimators

4.2.1 Estimator of the Mean Vector of 푿

The expectation of 흁̂풙 is

푇 푇 퐸(흁̂풙) = 퐸(푿̅풏) = 퐸[푋̅푛1, 푋̅푛2, … , 푋̅푛푝] = [퐸(푋̅푛1), 퐸(푋̅푛2), … , 퐸(푋̅푛푝)]

푇 = [휇푥1, 휇푥2, … 휇푥푝] = 흁풙 (4 - 39)

So 훍̂퐱 is an unbiased estimator. The covariance of 흁̂풙 is

푛 푛 1 Cov(훍̂ ) = Cov(퐗̅ ) = E(퐗̅ − 훍 )(퐗̅ − 훍 )푇 = {∑ ∑ E(퐗 − 훍 )(퐗 − 훍 )푇} 퐱 퐧 퐧 퐱 퐧 퐱 푛2 퐣 퐱 퐥 퐱 푗=1 푙=1

푛 1 푇 1 = ∑ E(퐗 − 훍 )(퐗 − 훍 ) = 횺 (4 - 40) 푛2 퐣 퐱 퐣 퐱 푛 xx 푗=1

By our assumptions,

푿 ~ 푁푝(흁풙, 횺xx)

Hence 흁̂풙 = 푿̅풏 is distributed as

79 1 흁̂ ~ 푁 (흁 , 횺 ) 풙 푝 풙 푛 xx

4.2.2 Estimator of the Covariance Matrix of 푿

Since 푿1,푿2, … , 푿n is a random sample of size 푛 from a p-variate normal distribution with mean 흁풙 and covariance matrix 횺xx, so

푛 푇 ∑(푿풊 − 푿̅푛)(푿풊 − 푿̅푛) ~ 푊푝( 횺xx, 푛 − 1) 푖=1 where 푊푝( 횺퐱퐱, 푛 − 1) is Wishart distribution with (푛 − 1) degree of freedom.

We have

푛 1 횺̂ = ∑(푿 − 푿̅ )(푿 − 푿̅ )푇 xx 푛 풊 푛 풊 푛 푖=1

So

푛횺̂xx ~ 푊푝( 횺xx, 푛 − 1)

Then by Nydick (2012), we have

퐸(푛횺̂xx) = (푛 − 1)횺xx

2 푉푎푟(푛Σ̂푖푗) = (푛 − 1)(Σ푖푗 + Σ푖푖Σ푗푗)

The expectation of 횺̂xx is

푛 − 1 퐸(횺̂ ) = 횺 (4 - 41) xx 푛 xx

So 횺̂퐱퐱 is a biased estimator.

푛 − 1 푉푎푟(Σ̂ ) = (Σ2 + Σ Σ ) (4 - 42) 푖푗 푛2 푖푗 푖푖 푗푗

If we define

80 푛 푛 1 퐒 = 횺̂ = ∑(푿 − 푿̅ )(푿 − 푿̅ )푇 (4 - 43) xn 푛 − 1 xx 푛 − 1 풊 푛 풊 푛 푖=1

Then we have 푛 퐸(퐒 ) = 퐸(횺̂ ) = 횺 xn 푛 − 1 xx xx

퐒xn is an unbiased estimator for 횺xx.

4.2.3 Estimator of the Regression Coefficient Matrix

As we do in Chapter 3, we will derive the conditional expectation and covariance matrix of 휷̂ given 푿 = 퐱 first, then derive the unconditional expectation and covariance matrix of the estimator.

The conditional expectation of 휷̂ given 푿 = 퐱 is

푚 ̂ −ퟏ −ퟏ −ퟏ 푇 E(휷|퐱) = E(퐒xxm퐒xym|퐱) = 퐒xxmE(퐒xym|퐱) = 퐒xxmE {∑(퐱퐣 − 퐱̅퐦) (퐘j − 퐘̅퐦) |퐱} 푗=1

푚 푚 −ퟏ 푇 −ퟏ 푇 푇 = 퐒xxm ∑(퐱퐣 − 퐱̅퐦)E(퐘푗 |퐱) = 퐒xxm ∑(퐱퐣 − 퐱̅퐦)[훍퐲 + 휷 (퐱풋 − 훍퐱)] 푗=1 푗=1

푚 −ퟏ 푇 푇 = 퐒xxm ∑(퐱퐣 − 퐱̅퐦) [훍풚 + (퐱풋 − 훍퐱) 휷] 푗=1

푚 풎 −ퟏ 푇 −ퟏ 푇 = 퐒xxm ∑(퐱퐣 − 퐱̅퐦) 퐱풋 휷 = 퐒xxm ∑(퐱퐣 − 퐱̅퐦)(퐱퐣 − 퐱̅퐦) 휷 푗=1 풋=ퟏ

−ퟏ = 퐒xxm퐒xxm휷 = 휷 (4 - 44)

So we have the unconditional expectation of 휷̂

퐸(휷̂) = 퐸[퐸(휷̂|퐱)] = 퐸(휷) = 휷 (4 - 45)

휷̂ is an unbiased estimator.

81 We use vec-operator to obtain the conditional covariance matrix of 휷̂ given 푿 =

퐱. Since

훽 훽 ⋯ 훽 11 12 1푞 훽21 훽22 ⋯ 훽2푞 휷 = = [휷(1) ⋮ 휷(2) ⋮ ⋯ ⋮ 휷(푞)] (4 - 46) ⋮ ⋮ ⋮ ⋮ [훽푝1 훽푝2 ⋯ 훽푝푞]

so

푇 푇 퐯퐞퐜(휷) = [훽11, 훽21, … , 훽푝1, 훽12, … , 훽푝2, … , 훽1푞, … , 훽푝푞] = [휷(1), 휷(2), … , 휷(푞)] (4 - 47)

Then by Loan (2009), we have

̂ −ퟏ Cov[퐯퐞퐜(휷|퐱)] = Cov[퐯퐞퐜(퐒xxm퐒xym|퐱)]

푚 −ퟏ 푇 = Cov {퐯퐞퐜 ∑ 퐒xxm(퐱퐣 − 퐱̅퐦) [(퐘j − 퐘̅퐦) |퐱]} 푗=1

푚 −ퟏ 푇 = Cov {퐯퐞퐜 ∑ 퐒xxm(퐱퐣 − 퐱̅퐦) (퐘j |퐱)} 푗=1

푚 −ퟏ 푇 = Cov {∑[퐈 ⊗ 퐒xxm(퐱퐣 − 퐱̅퐦)] 퐯퐞퐜(퐘j |퐱)} 푗=1

푚 −ퟏ 푇 −ퟏ 푇 = ∑[퐈 ⊗ 퐒xxm(퐱퐣 − 퐱̅퐦)] Cov[퐯퐞퐜(퐘j |퐱)][퐈 ⊗ 퐒xxm(퐱퐣 − 퐱̅퐦)] 푖푛푑푒푝푒푛푑푒푛푡 표푓 풀푗 푗=1

푚 −ퟏ 푇 −ퟏ = ∑[퐈 ⊗ 퐒xxm(퐱퐣 − 퐱̅퐦)] Cov(퐘푗|퐱) [퐈 ⊗ (퐱퐣 − 퐱̅퐦) 퐒xxm] 푗=1

푚 −ퟏ 푇 −ퟏ = ∑[퐈 ⊗ 퐒xxm(퐱퐣 − 퐱̅퐦)] 횺푒 [퐈 ⊗ (퐱퐣 − 퐱̅퐦) 퐒xxm] 푗=1

푚 −ퟏ 푇 −ퟏ = ∑[퐈 ⊗ 퐒xxm(퐱퐣 − 퐱̅퐦)] [횺푒 ⊗ (퐱퐣 − 퐱̅퐦) 퐒xxm] 푗=1

82 푚 −ퟏ 푇 −ퟏ = ∑ 횺푒 ⊗ 퐒xxm(퐱퐣 − 퐱̅퐦)(퐱퐣 − 퐱̅퐦) 퐒xxm 푗=1

푚 −ퟏ 푇 −ퟏ = 횺푒 ⊗ ∑ 퐒xxm(퐱퐣 − 퐱̅퐦)(퐱퐣 − 퐱̅퐦) 퐒xxm 푗=1

푚 −ퟏ 푇 −ퟏ = 횺푒 ⊗ 퐒xxm [∑(퐱퐣 − 퐱̅퐦)(퐱 − 퐱̅퐦) ] 퐒xxm 푗=1

−ퟏ = 횺푒 ⊗ 퐒xxm (4 - 48) where ⊗ stands for Kronecker Product.

By the Law of Total Covariance and by Nydick (2012), the unconditional covariance matrix of 퐯퐞퐜(휷̂) is

Cov[퐯퐞퐜(휷̂)] = 퐸{Cov [퐯퐞퐜(휷̂|퐱)]} + Cov{퐸[퐯퐞퐜(휷̂|퐱)]}

−ퟏ = 퐸[횺푒 ⊗ 퐒xxm] + Cov[퐯퐞퐜(휷)]

−ퟏ = 횺푒 ⊗ E(퐒xxm ) + ퟎ

1 = 횺 ⊗ 횺−1 (4 - 49) 푚 − 푝 − 2 푒 xx

Then we have

횺 ( ) Cov(휷 , 휷 ) = 푒 푖,푗 횺−1 (4 - 50) (푖) (푗) 푚 − 푝 − 2 xx

푖, 푗 = 1,2, … , 푞

When sample is large, 휷̂ is asymptotically normally distributed.

4.2.4 Estimator of the Mean Vector of 풀

As we do in 4.2.3, first we will derive the conditional expectation and covariance matrix of 흁̂풚 given 푿 = 퐱, then we derive the unconditional expectation and covariance matrix of the estimator. The conditional expectation of 흁̂풚 is

83 ̂푇 ̂푇 퐸(흁̂풚|퐱) = 퐸{[풀̅퐦 − 휷 (퐱̅풎 − 퐱̅풏)]|퐱} = 퐸(풀̅퐦|퐱) − 퐸(휷 |퐱)(퐱̅풎 − 퐱̅풏)

푇 푇 푇 = 흁풚 + 휷 (퐱̅풎 − 흁풙) − 휷 (퐱̅풎 − 퐱̅풏) = 흁풚 + 휷 (퐱̅풏 − 흁풙) (4 - 51) where

푚 푚 푚 1 1 1 퐸(풀̅ |퐱) = 퐸 ( ∑ 풀 |퐱 ) = ∑ 퐸(풀 |퐱 ) = ∑[흁 + 휷푇(퐱 − 흁 )] 퐦 푚 푗 푗 푚 풋 푗 푚 풚 풋 풙 푗=1 푗=1 푗=1

푚 1 = [푚흁 + 휷푇 (∑ 퐱 − 푚흁 )] = 흁 + 휷푇(퐱̅ − 흁 ) (4 - 52) 푚 풚 퐣 푥 풚 풎 풙 푗=1

So the expectation of 흁̂풚 is

푇 푇 퐸(흁̂풚) = 퐸[퐸(흁̂풚|퐗)] = 퐸[흁풚 + 휷 (퐗̅풏 − 흁풙)] = 흁풚 + 휷 (흁풙 − 흁풙) = 흁풚 (4 - 53)

흁̂풚 is an unbiased estimator.

The conditional covariance matrix of 흁̂풚 is

̂푇 Cov(흁̂풚|퐱) = Cov{[풀̅퐦 − 휷 (퐗̅퐦 − 퐗̅풏)]|퐱}

̂푇 ̂푇 = Cov(풀̅퐦|퐱) + Cov[휷 (퐱̅퐦 − 퐱̅풏)|퐱] − 2Cov[풀̅퐦, 휷 (퐱̅퐦 − 퐱̅풏)|퐱]

1 = 횺 + 횺 [(퐱̅ − 퐱̅ )푇퐒−ퟏ (퐱̅ − 퐱̅ )] (4 - 54) 푚 푒 푒 퐦 풏 xxm 퐦 풏 where

1 Cov(풀̅ |퐱) = 횺 (4 - 55) 퐦 푚 푒

̂푇 −ퟏ 푇 −ퟏ Cov[휷 (퐱̅퐦 − 퐱̅풏)|퐱] = Cov [(퐒xxm퐒xym) (퐱̅퐦 − 퐱̅풏)] |퐱 = Cov[(퐒yxm퐒xxm)(퐱̅퐦 − 퐱̅풏)]|퐱

푚 푇 −ퟏ = Cov [∑(퐘j − 퐘̅퐦) (퐱퐣 − 퐱̅퐦) ] 퐒xxm(퐱̅퐦 − 퐱̅풏)|퐱 푗=1

푚 푇 −ퟏ = Cov {∑(퐘j|퐱)(퐱퐣 − 퐱̅퐦) 퐒xxm(퐱̅퐦 − 퐱̅풏)} 푗=1

84 푚 푇 −ퟏ = ∑ Cov [(퐘j|퐱)(퐱퐣 − 퐱̅퐦) 퐒xxm(퐱̅퐦 − 퐱̅풏)] 푖푛푑푒푝푒푛푑푒푛푡 표푓 퐘j 푗=1

푇 −ퟏ (퐱퐣 − 퐱̅퐦) 퐒xxm(퐱̅퐦 − 퐱̅풏) is a scalar. Let

푇 −ퟏ 푎푗 = (퐱퐣 − 퐱̅퐦) 퐒xxm(퐱̅퐦 − 퐱̅풏)

Then

푇 2 푇 −ퟏ 푇 −ퟏ 푎푗 = [(퐱퐣 − 퐱̅퐦) 퐒xxm(퐱̅퐦 − 퐱̅풏)] [(퐱퐣 − 퐱̅퐦) 퐒xxm(퐱̅퐦 − 퐱̅풏)]

푇 −ퟏ 푇 −ퟏ = [(퐱̅퐦 − 퐱̅풏) 퐒xxm(퐱퐣 − 퐱̅퐦)] [(퐱퐣 − 퐱̅퐦) 퐒xxm(퐱̅퐦 − 퐱̅풏)]

Hence we have

푚 푚 푚 푚 ̂푇 2 2 2 Cov[휷 (퐱̅퐦 − 퐱̅풏)|퐱] = ∑ Cov[(퐘j|퐱)푎푗] = ∑ 푎푗 푉푎푟[(퐘j|퐱)] = ∑ 푎푗 횺푒 = 횺푒 ∑ 푎푗 푗=1 푗=1 푗=1 푗=1

푚 푇 −ퟏ 푇 −ퟏ = 횺푒 ∑[(퐱̅퐦 − 퐱̅풏) 퐒xxm(퐱퐣 − 퐱̅퐦)] [(퐱퐣 − 퐱̅퐦) 퐒xxm(퐱̅퐦 − 퐱̅풏)] 푗=1

풎 푇 −ퟏ 푇 −ퟏ = 횺푒(퐱̅퐦 − 퐱̅풏) 퐒xxm [∑(퐱퐣 − 퐱̅퐦) (퐱퐣 − 퐱̅퐦) ] 퐒xxm(퐱̅퐦 − 퐱̅풏) 풋=ퟏ

푇 −ퟏ = 횺푒(퐱̅퐦 − 퐱̅풏) 퐒xxm(퐱̅퐦 − 퐱̅풏) (4 - 56)

̂푇 ̂푇 푇 Cov[풀̅퐦, 휷 (퐱̅퐦 − 퐱̅풏)]|퐱 = Cov[(풀̅퐦, 휷 )|퐱](퐱̅퐦 − 퐱̅풏) = ퟎ (4 - 57)

To obtain the unconditional covariance matrix of 흁̂풚, we use the Law of Total

Covariance,

Cov(흁̂풚) = Cov[퐸(흁̂풚|퐱)] + 퐸[Cov(흁̂풚|퐱)] now

1 Cov[퐸(흁̂ |퐱)] = Cov[흁 + 휷푇(퐗̅ − 흁 )] = 휷푇Cov(퐗̅ )휷 = 휷푇횺 휷 (4 - 58) 풚 풚 풏 풙 풏 푛 xx

85 푇 −ퟏ To obtain 퐸 [(퐗̅퐦 − 퐗̅풏) 퐒xxm(퐗̅퐦 − 퐗̅풏)], we need to find the distribution of 퐗̅퐦 −

퐗̅풏. Since 푛 − 푚 퐗̅ − 퐗̅ = (퐗̅ − 퐗̅ ) (4 - 59) 퐦 풏 푛 퐦 풏−풎

퐗̅푚 and 퐗̅푛−푚 are independent and normally distributed, and

퐸(퐗̅퐦 − 퐗̅풏−풎) = 흁풙 − 흁풙 = ퟎ (4 - 60)

Cov(퐗̅퐦 − 퐗̅풏−풎) = Cov(퐗̅퐦) + Cov(퐗̅풏−풎)

1 1 푛 = 횺 + 횺 = 횺 (4 - 61) 푚 xx 푛 − 푚 xx 푚(푛 − 푚) xx

So we have

푛 퐗̅ − 퐗̅ ~ 푁 (ퟎ, 횺 ) 퐦 풏−풎 푝 푚(푛 − 푚) 퐱퐱

Hence

푛 − 푚 2 (퐗̅ − 퐗̅ )푇퐒−ퟏ (퐗̅ − 퐗̅ ) = ( ) (퐗̅ − 퐗̅ )푇퐒−ퟏ (퐗̅ − 퐗̅ ) 퐦 풏 xxm 퐦 풏 푛 퐦 풏−풎 xxm 퐦 풏−풎

푛 − 푚 2 푛 1 (퐗̅ − 퐗̅ )푇 퐒 −1 (퐗̅ − 퐗̅ ) = ( ) ∙ ∙ 퐦 풏−풎 ( xxm ) 퐦 풏−풎 푛 푚(푛 − 푚) 푚 − 1 푛 푚 − 1 푛 √푚(푛 − 푚) √푚(푛 − 푚)

푛 − 푚 2 푛 1 = ( ) ∙ ∙ 푇2 푛 푚(푛 − 푚) 푚 − 1 푝,푚−1 where

(퐗̅ − 퐗̅ )푇 퐒 −1 (퐗̅ − 퐗̅ ) 푇2 = 퐦 풏−풎 ( xxm ) 퐦 풏−풎 푝,푚−1 푛 푚 − 1 푛 (4 - 62) √푚(푛 − 푚) √푚(푛 − 푚) and

(푚 − 1)푝 (푚 − 1)푝 푚 − 푝 (푚 − 1)푝 퐸(푇2 ) = 퐸(퐹 ) = ∙ = 푝,푚−1 푚 − 푝 푝,푚−푝 푚 − 푝 푚 − 푝 − 2 푚 − 푝 − 2

86 The expectation of the conditional covariance matrix of 흁̂풚 is

1 퐸[Cov(흁̂ |퐱)] = 퐸 { 횺 + 횺 [(퐗̅ − 퐗̅ )푇퐒−ퟏ (퐗̅ − 퐗̅ )]} 풚 푚 푒 푒 퐦 풏 xxm 퐦 풏

1 = 횺 + 횺 퐸 [(퐗̅ − 퐗̅ )푇퐒−ퟏ (퐗̅ − 퐗̅ )] 푚 푒 푒 퐦 풏 xxm 퐦 풏 1 푛 − 푚 2 푛 1 = 횺 + 횺 ( ) ∙ ∙ ∙ 퐸(푇2 ) 푚 푒 푒 푛 푚(푛 − 푚) 푚 − 1 푝,푚−1

1 푛 − 푚 2 푛 1 (푚 − 1)푝 = 횺 + 횺 ( ) ∙ ∙ ∙ 푚 푒 푒 푛 푚(푛 − 푚) 푚 − 1 푚 − 푝 − 2

1 (푛 − 푚)푝 = [1 + ] 횺 (4 - 63) 푚 푛(푚 − 푝 − 2) 푒

Using (4 − 58) and (4 − 63), we have the unconditional covariance matrix of 흁̂풚 as

1 (푛 − 푚)푝 1 Cov(흁̂ ) = [1 + ] 횺 + 휷푇횺 휷 (4 - 64) 풚 푚 푛(푚 − 푝 − 2) 푒 푛 xx

When sample is large, 흁̂풚 is asymptotically normally distributed.

4.2.5 Estimator of the Conditional Covariance Matrix of Y given x

We use similar idea for the multiple regression model in Chapter 3. For given 풙풋,

푇 풀푗 = 흁푦 + 휷 (풙풋 − 흁풙) + 휺푗, 푗 = 1,2, … , 푚 (4 - 65) where

푇 퐘j = [푌푗1, 푌푗2, … , 푌푗푞]

푇 흁푦 = [휇푦1, 휇푦2, … , 휇푦푞]

훽 훽 ⋯ 훽 11 21 푝1 훽 훽 ⋯ 훽 휷푇 = 12 22 푝2 ⋮ ⋮ ⋮ ⋮ [훽1푞 훽2푞 ⋯ 훽푝푞]

푇 휺푗 = [휀푗1, 휀푗2, … 휀푗푞]

87 Cov(퐘j|퐱) = Cov(휺푗) = 횺푒 (4 - 66)

푇 퐸(휺 ) = 퐸(퐘푗|퐱) − 흁푦 − 휷 (풙풋 − 흁풙) = ퟎ (4 - 67)

We have

̂푇 휺̂푗 = 풀푗 − 흁̂y − 휷 (풙풋 − 흁̂풙)

̂푇 ̂푇 = 풀푗 − [풀̅푚 − 휷 (풙̅풎 − 풙̅풏)] − 휷 (풙풋 − 풙̅풏)

̂푇 = 풀푗 − 풀̅푚 − 휷 (풙푗 − 풙̅풎) (4 - 68)

Hence

푚 1 푇 1 횺̂ |퐱 = ∑[풀 − 풀̅ − 휷̂푇(풙 − 풙̅ )][풀 − 풀̅ − 휷̂푇(풙 − 풙̅ )] = 휺̂푇휺̂ (4 - 69) 푒 푚 풋 푚 푗 풎 풋 푚 푗 풎 푚 푗=1

By Results 7.10 in Johnson and Wichern (1998),

푇 푚횺̂푒|퐱 = 휺̂ 휺̂ ~ 푊푞,푚−푝−1(횺푒) where 푊푞,푚−푝−1(횺푒) is Wishart distribution with (푚 − 푝 − 1) degree of freedom.

The conditional expectation of 횺̂푒 is

푚 − 푝 − 1 퐸(횺̂ |퐱) = 횺 (4 - 70) 푒 푚 푒 and the conditional variance of Σ̂푒(푖푗) is

푚 − 푝 − 1 푉푎푟(Σ̂ |퐱) = [Σ2 + Σ Σ ] (4 - 71) 푒(푖푗) 푚2 푒(푖푗) 푒(푖푖) 푒(푗푗)

Both 퐸(횺̂푒|퐱) and 푉푎푟(Σ̂푒(푖푗)|퐱) do not involve X, so

푚 − 푝 − 1 퐸(횺̂ ) = 퐸[퐸(횺̂ |퐱)] = 횺 (4 - 72) 푒 푒 푚 푒

횺̂푒is a biased estimator for 횺푒. If we define

푚 1 푇 퐒 = ∑[풀 − 풀̅ − 휷̂푇(푿 − 푿̅ )][풀 − 풀̅ − 휷̂푇(푿 − 푿̅ )] (4 - 73) 푒 푚 − 푝 − 1 풋 푚 풋 풎 풋 푚 풋 풎 푗=1

Then

퐸(퐒푒) = 횺푒

88 퐒푒 is an unbiased estimator for 횺푒.

By the Law of the Total Variance, we have the unconditional variance of Σ̂푒(푖푗) as follows

푉푎푟(Σ̂푒(푖푗)) = 퐸[푉푎푟(Σ̂푒(푖푗)|퐱)] + 푉푎푟[퐸(Σ̂푒(푖푗)|퐱)]

푚 − 푝 − 1 = [Σ2 + Σ Σ ] (4 - 74) 푚2 푒(푖푗) 푒(푖푖) 푒(푗푗)

푇 Since 푚횺̂푒|퐱 = 휺̂휺̂ does not depend on X, so

푚횺̂푒~ 푊푞,푚−푝−1(횺푒) (4 - 75)

4.3 Prediction

푇 Suppose we have a future observation 푿0 = [푋0,1, 푋0,2, … , 푋0,푝] , 풀0 =

푇 [푌0,1, 푌0,2, … , 푌0,푞] with mean vector 흁 and covariance matrix 횺

흁풙 횺퐱퐱 횺퐱퐲 흁 = [ ] 횺 = [ ] 흁풚 횺퐲퐱 횺퐲퐲

As we do in Chapter 3, We have the following three kinds of prediction interval for 풀0:

3) Usual prediction interval for 풀0– conditioning on 퐗 = 퐱 푎푛푑 퐗0 = 퐱0

4) Prediction interval for 풀0– unconditional on 퐗, but conditioning on 퐗0 = 퐱0

5) Unconditional prediction interval for 풀0

4.3.1 Usual prediction interval

– Conditioning on 푿 = 풙 푎푛푑 푿0 = 풙0

The prediction value of 풀0 given 퐗 = 퐱 푎푛푑 퐗0 = 퐱0 is

89 ̂푇 풀̂0|퐱, 퐱0 = 흁̂풚 + 휷 (퐱0 − 퐱̅푛) (4 - 76)

The 푖푡ℎ response follows the multiple regression model in (3 − 64) in Chapter 3

̂ ̂푇 풀0(푖)|퐱, 퐱0 = 흁̂풚(푖) + 휷(푖)(퐱0 − 퐱̅푛), 푖 = 1,2, … , 푞 (4 - 77)

Hence, the 95% prediction interval for 풀0(푖) 푔푖푣푒푛 퐗 = 퐱 푎푛푑 퐗0 = 퐱0 follows

(3 − 73) too

푚Σ̂푒(푖푖) 1 풀̂ |퐱, 퐱 ± 푡 √ [1 + + (퐱 − 퐱̅ )푇퐒−ퟏ (퐱 − 퐱̅ )] (4 - 78) 0(푖) 0 0.025,푚−푝−1 푚 − 푝 − 1 푚 0 푚 xxm 0 푚

i = 1,2, … , 푞

4.3.2 Prediction interval

– Unconditional on 푿, but conditioning on 푿0 = 풙0

The prediction value of 풀0 given 퐗0 = 퐱0 is

̂푇 풀̂0|퐱0 = 흁̂풚 + 휷 (퐱0 − 흁̂풙) (4 - 79)

The 푖푡ℎ response follows the multiple regression model in (3 − 74) in Chapter 3

̂ ̂푇 풀0(푖)|퐱0 = 흁̂풚(푖) + 휷(푖)(퐱0 − 흁̂풙), 푖 = 1,2, … , 푞 (4 - 80)

The 95% prediction interval for 풀0(푖) 푔푖푣푒푛 퐗0 = 퐱0 follows (3 − 87) as

1 푛 − 푚 (퐱 − 퐱̅ )푇퐒−ퟏ(퐱 − 퐱̅ ) 풀̂ |퐱 ± 푧 √S [1 + + + 0 푛 퐱퐧 0 푛 ] (4 - 81) 0(푖) 0 0.025 푒(푖) 푚 푚푛(푚 − 3) 푚 − 푝 − 2 where

푚 1 2 S = ∑{푌 − 푌̅ − 휷̂푇 (퐗 − 퐗̅ )} (4 - 82) 푒(푖) 푚 − 푝 − 1 푗(푖) 푚(푖) (푖) j 풎 푗=1 and 퐒퐱n is given in (4 − 43).

90 4.3.3 Unconditional prediction interval

The prediction value of 풀0 is

̂푇 풀̂0 = 흁̂풚 + 휷 (푿0 − 흁̂퐱) (4 - 83)

The 푖푡ℎ response follows the multiple regression model in (3 − 91) in Chapter 3

̂ ̂푇 풀0(푖) = 흁̂풚(푖) + 휷(푖)(푿0 − 흁̂퐱), 푖 = 1,2, … , 푞 (4 - 84)

Hence, the 95% prediction interval for 풀̂0(푖) follows (3 − 102) too

1 (푛 − 푚)푝 풀̂ ± 푧 √S [1 + + ] (4 - 85) 0(푖) 0.025 푒(푖) 푚 푛(푚 − 푝 − 2)

where S푒(푖) is given in (4 − 82).

91

Appendix A

Statistical Estimation in Bivariate Normal Distribution

without Missing Observations

92

In this appendix, we will derive the MLE estimators and prediction interval for the bivariate normal distribution without observations missing.

A.1 Maximum Likelihood Estimators

If do not consider extra information 푋푚+1, 푋푚+2, … 푋푛 and only use the first 푚 observations in Chapter 2, then the joint likelihood function is

2 2 퐿(휇푦, 휇푥, 훽, 휎푥 , 휎푒 )

푚 푚 −푚 −푚 −푚 1 2 1 2 = (2휋) 휎푒 휎푥 ∙ 푒푥푝 {− 2 ∑(푦푗 − 휇푦 − 훽(푥푗 − 휇푥)) − 2 ∑(푥푗 − 휇푥) } 2휎푒 2휎푥 푗=1 푗=1

The log of the joint likelihood function is

2 2 푙(휇푦, 휇푥, 훽, 휎푥 , 휎푒 ) = −푚 푙푛(2휋) − 푚푙푛(휎푒) − 푚푙푛(휎푥)

푚 푚 1 2 1 2 2 ∑ (푦푗 − 휇푦 − 훽(푥푗 − 휇푥)) − 2 ∑(푥푗 − 휇푥) (A - 1) 2휎푒 2휎푥 푗=1 푗=1

Similarly, by taking the derivatives of the likelihood function (퐴 − 1) to each parameter, then setting it to be zero, we have the following estimating equations:

∑[푦푗 − 휇푦 − 훽(푥푗 − 휇푥)] = 0 (A - 2) 푗=1

푚 푚 훽 1 − 2 ∑[푦푗 − 휇푦 − 훽(푥푗 − 휇푥)] + 2 ∑(푥푗 − 휇푥) = 0 (A - 3) 휎푒 휎푥 푗=1 푗=1

푚 1 2 ∑ (푦푗 − 휇푦 − 훽(푥푗 − 휇푥)) (푥푗 − 휇푥) = 0 (A - 4) 휎푒 푗=1

푚 푚 1 2 − 2 + 4 ∑(푥푗 − 휇푥) = 0 (A - 5) 2휎푥 2휎푥 푗=1

푚 푚 1 2 − 2 + 4 ∑ (푦푗 − 휇푦 − 훽(푥푗 − 휇푥)) = 0 (A - 6) 2휎푒 2휎푒 푗=1

93

Simultaneously solve estimating equations (퐴 − 2) to (퐴 − 6), we obtain the following maximum likelihood estimators:

푚 1 휇̂ = ∑ 푋 = 푋̅ (A - 7) 푥_푛표 푚 푗 푚 푗=1

휇̂푦_푛표 = 푌̅푚 (A - 8)

∑푚 (푌 − 푌̅ )(푋 − 푋̅ ) (A - 9) ̂ 푗=1 푗 푚 푗 푚 훽_푛표 = 푚 ̅ 2 ∑푗=1(푋푗 − 푋푚)

푚 1 2 (A - 10) 휎̂2 = ∑(푋 − 푋̅ ) 푥_푛표 푚 푗 푚 푗=1

푚 1 (A - 11) σ̂2 = ∑[(푌 − 푌̅ ) − 훽̂(푋 − 푋̅ )]2 푒_푛표 푚 푗 푚 푗 푚 푗=1

Since only (퐴 − 7), (퐴 − 8) and (퐴 − 10) are different from corresponding estimators with extra information, and (퐴 − 7) and (퐴 − 10) are straightforward to derive, so here we give the derivation for (퐴 − 8).

The conditional expectation of 휇̂푦 given 푥 is

푚 푚 1 1 퐸(휇̂ |푥) = 퐸{[푌̅ ]|푥} = 퐸 ( ∑ 푌 |푥 ) = ∑[휇 + 훽(푥 − 휇 )] 푦 푚 푚 푗 푗 푚 푦 푗 푥 푗=1 푗=1

푚 1 = [푚휇 + 훽 ∑ 푥 − 푚훽휇 ] = 휇 + 훽(푥̅ − 휇 ) (A - 12) 푚 푦 푗 푥 푦 푚 푥 푗=1

Then we have

퐸(휇̂푦) = 퐸(퐸(휇̂푦|푋)) = 퐸[휇푦 + 훽(푋̅푚 − 휇푥)] = 휇푦 + 훽(휇푥 − 휇푥) = 휇푦 (A - 13)

So, 휇̂푦 is an unbiased estimator for 휇푦.

94

Similarly, the conditional variance of 휇̂푦 given 푥 is

푚 푚 1 1 1 휎2 푉푎푟(휇̂ |푥) = 푉푎푟(푌̅ |푥) = 푉푎푟 ( ∑ 푌 |푥 ) = ∑ 푉푎푟(푌 |푥 ) = ∙ 푚휎2 = 푒 (A - 14) 푦 푚 푚 푗 푗 푚2 푗 푗 푚2 푒 푚 푗=1 푗=1

By the Law of Total Variance,

푉푎푟(휇̂푦) = 퐸 (푉푎푟(휇̂푦|푋)) + 푉푎푟(퐸(휇̂푦|푋)) where

훽2휎 푉푎푟(퐸(휇̂ |푋)) = 푉푎푟(μ + β(푋̅ − μ )) = 훽2푉푎푟(푋̅ ) = 푥푥 (A - 15) 푦 y 푚 x 푚 푚

휎2 휎2 퐸 (푉푎푟(휇̂ |푌)) = 퐸 ( 푒 ) = 푒 푦 푚 푚

Hence

휎2 훽2휎 푉푎푟(휇̂ ) = 푒 + 푥푥 (A - 16) 푦 푚 푚

A.2 The prediction Interval

A.2.1 Usual prediction interval

The prediction value of 푌0 is

̂ 푌̂0|푥, 푥0 = 휇̂푦 + 훽(푥0 − 푥̅푚) (A - 17)

By (퐴 − 12) and (퐴 − 16),

휎2 (휇̂ |푥) ~ 푁 (휇 + 훽(푥̅ − 휇 ), 푒 ) (A - 18) 푦 푦 푚 푥 푚

So, the expectation of 푌̂0|푥, 푥0 is

̂ 퐸(푌̂0|푥, 푥0) = 퐸(휇̂푦|푥, 푥0) + 퐸(훽|푥, 푥0)(푥0 − 푥̅푚)

̂ = 퐸(휇̂푦|푥) + 퐸(훽|푥)(푥0 − 푥̅푚)

= 휇푦 + 훽(푥̅푚 − 휇푥) + 훽(푥0 − 푥̅푚)

95

= 휇푦 + 훽(푥0 − 휇푥) (A - 19)

The variance of 푌̂0|푥, 푥0 is

2 ̂ ̂ 푉푎푟(푌̂0|푥, 푥0) = 푉푎푟(휇̂푦|푥) + (푥0 − 푥̅푚) 푉푎푟(훽|푥) + 2(푥0 − 푥̅푚)퐶표푣[(휇̂푦, 훽)|푥]

휎2 휎2(푥 − 푥̅ )2 = 푒 + 푒 0 푚 − 0 푚 푚 2 ∑푗=1(푥푗 − 푥̅푚)

2 2 2 휎푒 휎푒 (푥0 − 푥̅푚) = + (A - 20) 푚 푚 2 ∑푗=1(xj − 푥̅푚)

Hence, the expectation of (푌0 − 푌̂0)|푥, 푥0 is

퐸(푌0 − 푌̂0)|푥, 푥0 = 퐸(푌0|푥, 푥0) − 퐸(푌̂0|푥, 푥0)

= 휇푦 + 훽(푥0 − 휇푥) − [휇푦 + 훽(푥0 − 휇푥)] = 0 (A - 21)

And the variance of (푌0 − 푌̂0)|푥, 푥0 is

푉푎푟(푌0 − 푌̂0)|푥, 푥0 = 푉푎푟(푌0)|푥, 푥0 + 푉푎푟(푌̂0)|푥, 푥0 − 2퐶표푣(푌̂0, 푌0)|푥, 푥0

2 2 2 2 휎푒 휎푒 (푥0 − 푥̅푚) = 휎푒 + + − 2 ∙ 0 푚 푚 2 ∑푗=1(xj − 푥̅푚)

2 2 1 (푥0 − 푥̅푚) = 휎푒 [1 + + ] (A - 22) 푚 푚 2 ∑푗=1(xj − 푥̅푚)

Hence, the 95% prediction interval for 푌0 푔푖푣푒푛 푋 = 푥 푎푛푑 푋0 = 푥0 is

2 2 푚휎̂푒 1 (푥0 − 푥̅푚) 푌̂0|푥, 푥0 ± 푡0.025,푚−2√ [1 + + ] (A - 23) 푚 − 2 푚 푚 2 ∑푗=1(xj − 푥̅푚)

96

A.2.2 Prediction interval

Let

푍1 휇̂푦 풁 = [푍2] = [휇̂푥] 푍3 훽̂

Then

휇푦 E(풁|푥0) = [휇푥] 훽

휎2 훽2휎2 훽휎2 푒 + 푥 푥 0 푚 푚 푚 훽휎2 휎2 Cov(풁|푥 ) == 푥 푥 0 (A - 24) 0 푚 푚 2 휎푒 0 0 2 [ (푚 − 3)휎푥 ]

The variance of 푌̂0|푥0 is

2 3 2 ′ Var(푌̂0|푥0) = 퐸[푌̂0|푥0 − 퐸(푌̂0| 0)] ≈ 퐸 ⌈∑(푦̂0푗|푥0)( 푍푗 − 휇푍푗)⌉ 푗=1

3 3 3 ′ 2 ′ ′ = ∑[(푦̂0푗|푥0)] 푉푎푟(푍푗) + 2 ∑ ∑ (푦̂0푖|푥0)(푦̂0푗|푥0)퐶표푣(푍푖, 푍푗) 푗=1 푖=1 푗=1,푖≠푗

2 2 2 = 1 푉푎푟(푍1) + (−훽) 푉푎푟(푍2) + (푥0 − 휇푥) 푉푎푟(푍3) + 2 ∙ 1 ∙ (−훽)퐶표푣(푍1, 푍2)

2 2 2 ̂ = 1 푉푎푟(휇̂푦) + (−훽) 푉푎푟(휇̂푥) + (푥0 − 휇푥) 푉푎푟(훽) + 2 ∙ 1 ∙ (−훽)퐶표푣(휇̂푦, 휇̂푥)

2 2 2 2 2 2 2 2 휎푒 훽 휎푥 훽 휎푥 2 휎푒 훽 휎푥 = + + + (푥0 − 휇푥) − 2 푚 푚 푚 (푚 − 3)휎푥푥 푚

( )2 2 1 푥0 − 휇푥 = 휎푒 [ + 2] (A - 25) 푚 (푚 − 3)휎푥

Hence, the variance of (푌0 − 푌̂0)|푥0 is

푉푎푟(푌0 − 푌̂0)|푥0 = 푉푎푟(푌0)|푥0 + 푉푎푟(푌̂0)|푥0 − 2퐶표푣(푌̂0, 푌0)|푥0

97

( )2 2 2 1 푥0 − 휇푥 = 휎푒 + 휎푒 [ + 2] − 2 ∙ 0 푚 (푚 − 3)휎푥

( )2 2 1 푥0 − 휇푥 = 휎푒 [1 + + 2] (A - 26) 푚 (푚 − 3)휎푥

A.2.3 Unconditional prediction interval

Let

푍 휇̂푦 1 푍 휇̂ 풁 = [ 2] = 푥 푍3 훽̂ 푍 4 [푋0]

Then

휇푦 휇 E(풁) = 흁 = [ 푥] 푍 훽 휇푥

2 2 2 2 휎푒 훽 휎푥 훽휎푥 + 0 0 푚 푚 푚 2 2 훽휎푥 휎푥 0 0 Cov(풁) = 푚 푚 (A - 27) 휎2 푒 0 0 2 0 (푚 − 3)휎푥 0 0 2 [ 0 휎푥 ]

The variance of 푌̂0 is

2 4 2 ′ Var(푌̂0) = E[푌̂0 − 퐸(푌̂0)] ≈ E ⌈∑ 푦̂0푗(흁푍)퐸( 푍푗 − 휇푍푗)⌉ 푗=1

4 4 4 ′ 2 ′ ′ = ∑[푦̂0푗(흁푍)] 푉푎푟(푍푗) + 2 ∑ ∑ 푦̂0푖(흁푍)푦̂0푗(흁푍)퐶표푣(푍푖, 푍푗) 푗=1 푖=1 푗=1,푖≠푗

2 2 2 2 = 1 푉푎푟(푍1) + (−훽) 푉푎푟(푍2) + 0 ∙ 푉푎푟(푍3) + 훽 푉푎푟(푍4) + 2 ∙ 1 ∙ (−훽)퐶표푣(푍1, 푍2)

98

2 2 2 ̂ 2 = 1 푉푎푟(휇̂푦) + (−훽) 푉푎푟(휇̂푥) + 0 ∙ 푉푎푟(훽) + 훽 푉푎푟(푋0) + 2 ∙ 1 ∙ (−훽)퐶표푣(휇̂푦, 휇̂푥)

휎2 훽2휎2 훽2휎2 훽2휎2 = 푒 + 푥 + 푥 + 0 + 훽2휎2 − 2 푥 푚 푚 푚 푥 푚

휎2 = 푒 + 훽2휎2 (A - 28) 푚 푥

Hence, the variance of (푌0 − 푌̂0) are

푉푎푟(푌0 − 푌̂0) = 푉푎푟(푌0) + 푉푎푟(푌̂0) − 2퐶표푣(푌̂0, 푌0)

휎2 휎2 1 = 휎2 + 푒 + 훽2휎2 − 2 ∙ 훽2휎2 = 휎2 + 푒 − 훽2휎2 = 휎2 [1 + ] (A - 29) 푦 푚 푥 푥 푦 푚 푥 푒 푚

The 95% prediction interval for 푌0 is

1 푌̂ ± 푧 √S2 (1 + ) (A - 30) 0 0.025 푒 푚

99

Appendix B

Fisher Information Matrix for Bivariate Normal Distribution

100

Since we have five parameters, so the Fisher Information Matrix 푰(휃) is defined by a 5 × 5 matrix, and the (푗, 푘) entry of 푰(휃) is given by

휕2푙(푋, 푌, 휃) 퐼휃푗휃푘 = −퐸 [ ] 휕휃푗휕휃푘

Where

2 2 푙(푋, 푌, 휃) = 푙(휇푦, 휇푥, 훽, 휎푥 , 휎푒 ) is the log of the joint likelihood function (2 − 8).

2 2 2 휕 푙(휇푦, 휇푥, 훽, 휎푥 , 휎푒 ) 퐼휇 휇푥 = −퐸 [ 2 ] 휕휇푥

휕 훽 푚 1 푛 = −퐸 { [− 2 ∑푗=1[푌푗 − 휇푦 − 훽(푋푗 − 휇푥)] + 2 ∑푖=1(푋푖 − 휇푥)]} 휕휇푥 휎푒 휎푥

훽 푚 1 푛 = −퐸 {− 2 ∑푗=1 훽 + 2 ∑푖=1(−1)} 휎푒 휎푥

푚훽2 푛 = 2 + 2 휎푒 휎푥

2 2 2 휕 푙(휇푦, 휇푥, 훽, 휎푥 , 휎푒 ) 퐼휇푥휇푦 = −E [ ] 휕휇푥휕휇푦

푚 푛 휕 훽 1 = −E { [− 2 ∑[푌푗 − 휇푦 − 훽(푋푗 − 휇푥)] + 2 ∑(푋푖 − 휇푥)]} 휕휇푦 휎푒 휎푥 푗=1 푖=1

푚 훽 = 2 ∑(−1) 휎푒 푗=1

푚훽 = − 2 휎푒

2 2 2 휕 푙(휇푦, 휇푥, 훽, 휎푥 , 휎푒 ) 퐼휇푥훽 = −E [ ] 휕휇푥휕훽

푚 푛 휕 훽 1 = −E { [− 2 ∑[푌푗 − 휇푦 − 훽(푋푗 − 휇푥)] + 2 ∑(푋푖 − 휇푥)]} 휕훽 휎푒 휎푥 푗=1 푖=1

101

1 푚 = −E {− 2 ∑푗=1[푌푗 − 휇푦 − 2훽(푋푗 − 휇푥)]} 휎푒

푚 1 = 2 퐸 {∑[푌푗 − 휇푦 − 2훽(푋푗 − 휇푥)]} 휎푒 푗=1

= 0

2 2 2 휕 푙(휇푦, 휇푥, 훽, 휎푥 , 휎푒 ) 퐼 2 = −E [ ] 휇푥휎푥 2 휕휇푥휕휎푥

푚 푛 휕 훽 1 = −E { 2 [− 2 ∑[푌푗 − 휇푦 − 훽(푋푗 − 휇푥)] + 2 ∑(푋푖 − 휇푥)]} 휕휎푥 휎푒 휎푥 푗=1 푖=1

푛 1 = −E {− 4 ∑(푋푖 − 푥)} 휎푥 푖=1

푛 1 = 4 ∑[E(푋푖) − 휇푥] 휎푥 푖=1

= 0

2 2 2 휕 푙(휇푦, 휇푥, 훽, 휎푥 , 휎푒 ) 퐼 2 = −E [ ] 휇푥휎푒 2 휕휇푥휕휎푒

푚 푛 휕 훽 1 = −E { 2 [− 2 ∑[푌푗 − 휇푦 − 훽(푋푗 − 휇푥)] + 2 ∑(푋푖 − 휇푥)]} 휕휎푒 휎푒 휎푥 푗=1 푖=1

푚 훽 = −E { 4 ∑[푌푗 − 휇푦 − 훽(푋푗 − 휇푥)]} 휎푒 푗=1

훽 푚 = − 4 ∑푗=1[E(푌푗) − 휇푦 − 훽(E(푋푗) − 휇푥)] 휎푒

= 0

2 2 2 휕 푙(휇푦, 휇푥, 훽, 휎푥 , 휎푒 ) 퐼휇푦휇푦 = −E [ 2 ] 휕휇푦

102

푚 휕 1 = −E { [ 2 ∑ (푌푗 − 휇푦 − 훽(푋푗 − 휇푥))]} 휕휇푦 휎푒 푗=1

푚 = −E {− 2} 휎푒

푚 = 2 휎푒

2 2 2 휕 푙(휇푦, 휇푥, 훽, 휎푥 , 휎푒 ) 퐼휇푦훽 = −E [ ] 휕휇푦휕훽

푚 휕 1 = −E { [ 2 ∑ (푌푗 − 휇푦 − 훽(푋푗 − 휇푥))]} 휕훽 휎푒 푗=1

1 푚 = −E [− 2 ∑푗=1(푋푗 − 휇푥)] 휎푒

푚 1 = 2 ∑[E(푋푗 − 휇푥)] 휎푒 푗=1

= 0

2 2 2 휕 푙(휇푦, 휇푥, 훽, 휎푥 , 휎푒 ) 퐼 2 = −E [ ] 휇푦휎푥 휕휇푦휕휎푥푥

푚 휕 1 = −E { 2 [ 2 ∑ (푌푗 − 휇푦 − 훽(푋푗 − 휇푥))]} 휕휎푥 휎푒 푗=1

= 0

2 2 2 휕 푙(휇푦, 휇푥, 훽, 휎푥 , 휎푒 ) 퐼 2 = −E [ ] 휇푦휎푒 2 휕휇푦휕휎푒

푚 휕 1 = −E { 2 [ 2 ∑ (푌푗 − 휇푦 − 훽(푋푗 − 휇푥))]} 휕휎푒 휎푒 푗=1

푚 1 = −E {− 4 ∑ (푌푗 − 휇푦 − 훽(푋푗 − 휇푥))} 휎푒 푗=1

103

푚 1 = 4 ∑{E[푌푗 − 휇푦 − 훽(푋푗 − 휇푥)]} 휎푒 푗=1

= 0

휕2푙(휇 , 휇 , 훽, 휎2, 휎2) 퐼 = −E [ 푦 푥 푥 푒 ] 훽훽 휕훽2

푚 휕 1 = −E { [ 2 ∑ (푌푗 − 휇푦 − 훽(푋푗 − 휇푥)) (푋푗 − 휇푥)]} 휕훽 휎푒 푗=1

푚 1 2 = 2 E {∑(푋푗 − 휇푥) } 휎푒 푗=1

푚 푚 2 2 2 휎푥 (푋푗 − 휇푥) (푋푗 − 휇푥) 2 = 2 E {∑ 2 } 푠푖푛푐 {∑ 2 } ~휒 (푚) 휎푒 휎푥 휎푥 푗=1 푗=1

2 푚휎푥 = 2 휎푒

2 2 2 휕 푙(휇푦, 휇푥, 훽, 휎푥 , 휎푒 ) 퐼 2 = −E [ ] 훽휎푥 2 휕훽휕휎푥

푚 휕 1 = −E { 2 [ 2 ∑ (푌푗 − 휇푦 − 훽(푋푗 − 휇푥)) (푋푗 − 휇푥)]} 휕휎푥 휎푒 푗=1

= 0

2 2 2 휕 푙(휇푦, 휇푥, 훽, 휎푥 , 휎푒 ) 퐼 2 = −E [ ] 훽휎푒 2 휕훽휕휎푒

푚 휕 1 = −E { 2 [ 2 ∑ (푌푗 − 휇푦 − 훽(푋푗 − 휇푥)) (푋푗 − 휇푥)]} 휕휎푒 휎푒 푗=1

푚 1 = −E {− 4 ∑ (푌푗 − 휇푦 − 훽(푋푗 − 휇푥)) (푋푗 − 휇푥)} 휎푒 푗=1

104

푚 1 = 4 E {∑ (푌푗 − 휇푦 − 훽(푋푗 − 휇푥)) (푋푗 − 휇푥)} 휎푒 푗=1

= 0

2 2 2 휕 푙(휇 , 휇푥, 훽, 휎푥 , 휎푒 ) 퐼 2 2 = −E [ ] 휎푥휎푥 4 휕휎푥

푛 휕 푛 1 2 = −E { 2 [− 2 + 4 ∑(푋푖 − 휇푥) ]} 휕휎푥 2휎푥 2휎푥 푖=1

푛 1 푛 2 = −E { 4 − 6 ∑푖=1(푋푖 − 휇푥) } 2휎푥 휎푥

푛 푛 ( )2 ( )2 푛 1 푋푖 − 휇푥 푋푖 − 휇푥 2 = − 4 + 4 E ∑ 2 푠푖푛푐푒 ∑ 2 ~휒 (푛) 2휎푥 휎푥 푥 휎푥 푖=1 푖=1

푛 푛 = − 4 + 4 2휎푥 휎푥

푛 = 4 2휎푥

2 2 2 휕 푙(휇푦, 휇푥, 훽, 휎푥 , 휎푒 ) 퐼 2 2 = −E [ ] 휎푥휎푒 2 2 휕휎푥 휕휎푒

푛 휕 푛 1 2 = −E { 2 [− 2 + 4 ∑(푋푖 − 휇푦) ]} 휕휎푒 2휎푥 2휎푥 푖=1

= 0

2 2 휕 푙(휇푦, 휇푥, 훽, 휎푥 , 휎) 퐼 2 2 = −E [ ] 휎푒 휎푒 2 2 휕(휎푒 )

푚 휕 푚 1 2 = −E { 2 [− 2 + 4 ∑ (푌푗 − 휇푦 − 훽(푋푗 − 휇푥)) ]} 휕휎푒 2휎푒 2휎푒 푗=1

105

푚 푚 1 2 = −E { 4 − 6 ∑ (푌푗 − 휇푦 − 훽(푋푗 − 휇푥)) } 2휎푒 휎푒 푗=1

푚 푚 1 2 = − 4 + 6 E {∑ (푌푗 − 휇푦 − 훽(푋푗 − 휇푥)) } 2휎푒 휎푒 푗=1

푚 푚 1 2 = − 4 + 6 ∑ 휎푒 2휎푒 휎푒 푗=1

2 푚 푚휎푒 = − 4 + 6 2휎푒 휎푒

푚 = 4 2휎푒

Where

2 E((푌푗 − 휇푦 − 훽(푥푗 − 휇푥)) |푥)

2 2 2 = E {(푌푗 − 휇푦) − 2(푌푗 − 휇푦)훽(푥푗 − 휇푥) + (푥푗 − 휇푥) } |푥

2 2 2 = E(푌푗 − 휇푦) |푥 − 2훽(푥푗 − 휇푥)E(푌푗 − 휇푦)|푥 + 훽 (푥푗 − 휇푥)

2 2 2 2 2 2 2 = 휎푒 + 훽 (푥푗 − 휇푥) − 2훽 (푥푗 − 휇푥) + 훽 (푥푗 − 휇푥)

2 = 휎푒

Hence

2 2 2 2 E ((푌푗 − 휇푦 − 훽(푋푗 − 휇푥)) ) = E [E ((푌푗 − 휇푦 − 훽(푋푗 − 휇푥)) |푋)] = 퐸(휎푒 ) = 휎푒

The Fisher Information Matrix is

퐼휇 휇 퐼휇 휇 퐼휇 훽 퐼휇 휎2 퐼 2 푦 푦 푦 푥 푦 푦 푥 휇푦휎푒 퐼휇 휇 퐼휇 휇 퐼휇 훽 퐼휇 휎2 퐼 2 푥 푦 푥 푥 푥 푥 푥 휇푥휎푒 2 2 퐼 퐼 퐼 퐼 2 퐼 2 푰(휇푦, 휇푥, 훽, 휎푥 , 휎푒 ) = 훽휇푦 훽휇푥 훽훽 훽휎푥 훽휎 푒 퐼휎2휇 퐼휎2휇 퐼휎2훽 퐼휎2휎2 퐼 2 2 푥 푦 푥 푥 푥 푥 푥 휎푥휎푒 퐼 2 퐼 2 퐼 2 퐼 2 2 퐼 2 2 { 휎푒휇푦 휎푒휇푥 휎푒훽 휎푒휎푥 휎푒휎푒 }

106

푚 푚훽 2 − 2 0 0 0 휎푒 휎푒 푚훽 푚훽2 푛 − 2 2 + 2 0 0 0 휎 휎푒 휎푥 2 = 푚휎푥 0 0 2 0 0 휎푒 푛 0 0 0 4 0 2휎푥 푚 0 0 0 0 4 { 2휎푒 }

2 2 The inverse of the 푰(휇푦, 휇푥, 훽, 휎푥 , 휎푒 ) is

휎2 훽2휎2 훽휎2 푒 + 푥 푥 0 0 0 푚 푛 푛 2 2 훽휎푥 휎푥 0 0 0 푛 푛 휎2 푰−1(휇 , 휇 , 훽, 휎2, 휎2) = 푒 푦 푥 푥 푒 0 0 2 0 0 푚휎푥 4 2휎푥 0 0 0 0 푛 2휎4 0 0 0 0 푒 { 푚 }

107

Appendix C

Some derivation used in the dissertation

108

In this appendix, we derived some formula used in the dissertation.

푚 푚

∑(푦푗 − 푦̅푚)(푥푗 − 푥̅푛) = ∑(푦푗푥푗 − 푦̅푚 푥푗 − 푦푗푥̅푛 + 푦̅푚푥̅푛) 푗=1 푗=1

푚 푚 푚

= ∑ 푦푗푥푗 − 푦̅푚 ∑ 푥푗 − 푥̅푛 ∑ 푦푗 + 푚푦̅푚푥̅푛 푗=1 푗=1 푗=1

푚 푚

= ∑ 푦푗푥푗 − 푦̅푚 ∑ 푥푗 − 푚푥̅푛푦̅푚 + 푚푦̅푚푥̅푛 푗=1 푗=1

푚 푚 푚 푚

= ∑ 푦푗푥푗 − 푦̅푚 ∑ 푥푗 = ∑ 푥푗(푦푗 − 푦̅푚) = ∑ 푦푗(푥푗 − 푥̅푚) (C - 1) 푗=1 푗=1 푗=1 푗=1

푚 푚 2 ∑ ( 푥푗 − 푥̅푚)(푥푗 − 푥̅푛) = ∑(푥푗 − 푥̅푚푥푗 − 푥푗푥̅푛 + 푥̅푚푥̅푛) 푗=1 푗=1

푚 푚 푚 2 = ∑ 푥푗 − 푥̅푚 ∑ 푥푗 − 푥̅푛 ∑ 푥푗 + 푚푥̅푚푥̅푛 푗=1 푗=1 푗=1

푚 푚 2 = ∑ 푥푗 − 푥̅푚 ∑ 푥푗 − 푚푥̅푛푥̅푚 + 푚푥̅푚푥̅푛 푗=1 푗=1

푚 푚 푚 푚 2 2 = ∑ 푥푗 − 푥̅푚 ∑ 푥푗 = ∑ 푥푗 − 푥̅푚 ∑ 푥푗 − 푚푥̅푚푥̅푚 + 푚푥̅푚푥̅푚 푗=1 푗=1 푗=1 푗=1

푚 푚 푚 푚 푚 2 2 = ∑ 푥푗 − 푥̅푚 ∑ 푥푗 − 푥̅푚 ∑ 푥푗 + ∑ 푥̅푚푥̅푚 = ∑(푥푗 − 푥̅푚) (C - 2) 푗=1 푗=1 푗=1 푗=1 푗=1

푚 푚 푚 푚 2 2 ∑ 푥푗 − 푚푥̅푚푥̅푚 = ∑ 푥푗 − ∑ 푥푗 푥̅푚 = ∑ 푥푗(푥푗 − 푥̅푚) (C - 3) 푗=1 푗=1 푗=1 푗=1

109

∑푚 푌 (x − 푥̅ ) ̂ 푗=1 푗 j 푚 Cov[(푌푗, 훽)|푥] = Cov(푌푗|푥, |푥) 푚 2 ∑푗=1(xj − 푥̅푚)

1 = Cov[푌푗|푥, 푌1(푥1 − 푥̅푚)|푥 + 푌2(푥2 − 푥̅푚)|푥 + ⋯ 푌푗(푥푗 − 푥̅푚)|푥 + ⋯ 푌푚(푥푚 − 푥̅푚)|푥] 푚 2 ∑푗=1(푥푗 − 푥̅푚)

1 = Cov[푌푗|푥, 푌푗(푥푗 − 푥̅푚)|푥] 푌푗|푥, 푌푘|푥 푎푒 푖푛푑푒푝푒푛푑푒푛푡 푓표푟 푗 ≠ 푘 푚 2 ∑푗=1(푥푗 − 푥̅푚)

2 (푥푗 − 푥̅푚) (푥푗 − 푥̅푚)휎푒 = Var(푌푗|푥) = (C - 4) 푚 2 푚 2 ∑푗=1(푥푗 − 푥̅푚) ∑푗=1(푥푗 − 푥̅푚)

1 Cov[(푌̅, 훽̂)|푥] = Cov[푌 |푥 + 푌 |푥 … 푌 |푥, 훽̂|푥] 푚 1 2 푚

푚 푚 1 1 (푥 − 푥̅ )휎2 휎2 ∑푚 (푥 − 푥̅ ) ̂ 푗 푚 푒 푒 푗=1 푗 푚 = ∑ Cov[(푌푗, 훽)|푥] = ∑ = = 0 (C - 5) 푚 푚 푚 2 푚 푚 2 푗=1 푗=1 ∑푗=1(푥푗 − 푥̅푚) ∑푗=1(푥푗 − 푥̅푚)

Hence,푌̅|푥 푎푛푑 훽̂|푥 푎푟푒 푁푂푇 푐표푟푟푒푙푎푡푒푑.

110

Appendix D

R codes

111

In this appendix, we list R codes for the example estimators and variance comparison.

D.1 R Code for the Estimators install.packages("bayesSurv") #install this package to calculate sample covariance install.packages("MVN") #Multivariate normality test install.packages("usdm") #Multicollinearity test library(MVN) varNames <- c ("Y", "GPA", "GVerbal", "GQuantitative", "GAnalytic", "TOEFL") varFormats <- c ("numeric", # Master GPA "numeric", # Undergraduate GPA "numeric", # GRE Verbal "numeric", # GRE Quantitative "numeric", # GRE Analytic "numeric") # TOEFL directory <- "D:/PhD Research/Book data" filename <- "GPA with Block Missing realdata.txt" fullFile <- paste (directory, filename, sep = "/") mydata <- read.csv (fullFile, stringsAsFactors = FALSE, nrow = -1, col.names = varNames, colClasses = varFormats, sep = "\t") mydatas <- head(subset.matrix(mydata,select=GVerbal:TOEFL,(!is.na(mydata[,4]))),40) #Obtain the sample data n <- 40 m <- 20 mydatas_m<-subset(mydatas, (!is.na(mydatas[,4]))) #Ignore TOEFL missing data

# Calculate 5 estimates - Multivariate case one_n <- as.matrix(rep(1,40)) #vector one 40 by 1 one_m <- as.matrix(rep(1,20)) multi_xndata <- as.matrix(subset(mydatas,select=GVerbal:GAnalytic)) m_xn <- as.matrix(colMeans(multi_xndata)) diff_xn <- multi_xndata - one_n %*% t(m_xn) ss0 <- diff_xn[1,] %*% t(diff_xn[1,]) ssn <- 0*ss0 for (i in 1:40) { temp <- diff_xn[i,] %*% t(diff_xn[i,]) ssn <- temp + ssn } sigma_xxn <- ssn/n #sigma_xx_n

112

library(bayesSurv) s_xx_n0 <- (n-1)*sampleCovMat(multi_xndata)/n #for check purpose ym <- mydatas_m[,4] y_bar_m <- sum(ym)/m multi_xmdata <- as.matrix(subset(mydatas_m,select=GVerbal:GAnalytic)) m_xm <- as.matrix(colMeans(multi_xmdata)) diff_xm <- multi_xmdata - one_m %*% t(m_xm) ssm <- 0*ss0 for (i in 1:20) { temp <- diff_xm[i,] %*% t(diff_xm[i,]) ssm <- temp + ssm } sigma_xxm <- ssm/m #Sigma_xx_m s_xx_m0 <- (m-1)*sampleCovMat(multi_xmdata)/m #Check d <- m_xm - m_xn #difference between mean xm and mean xn diff_ym <- ym - y_bar_m sxy0 <- diff_ym[1]*diff_xm[1,] sxy <- 0*sxy0 for (i in 1:20) { temp <- diff_ym[i] *diff_xm[i,] sxy <- temp + sxy } sxy_m <- as.matrix(sxy) beta_hat <- solve(ssm) %*% sxy_m mu_yhat <- y_bar_m - t(beta_hat) %*% d #mu_y_hat bm <- 0 for (i in 1:20) { temp <- (diff_ym[i] - t(beta_hat) %*% diff_xm[i,])^2 bm <- temp + bm } sigma_e_hat <- bm/m #Sigma_e^2

#check beta fit <- lm(TOEFL ~ GVerbal + GQuantitative + GAnalytic, data=mydatas_m) fit beta <- coefficients(fit) beta

############################################################## cov(mydatas_m[,1],mydatas_m[,4]) cor(mydatas_m[,1],mydatas_m[,4]) #rou=0.3252701 --selected since rou_square>1/18 cor(mydatas_m[,2],mydatas_m[,4]) #rou=-0.07477673 cor(mydatas_m[,3],mydatas_m[,4]) #rou=0.166066

#Calculate 5 estimates for bivariate cases #1-Verbal score as x bdata <- subset(mydatas,select=c(GVerbal,TOEFL)) n <- 40 m <- 20 xn <- bdata[,1] x_bar_n <- sum(xn)/n

113

bdata_m <- subset(bdata,(!is.na(bdata[,2]))) xm <- bdata_m[,1] ym <- bdata_m[,2] x_bar_m <- sum(xm)/m y_bar_m <- sum(ym)/m beta_num <- sum((ym-mean(ym))*(xm-mean(xm))) beta_den <- sum((xm-mean(xm))^2) beta <- beta_num/beta_den muy_hat <- y_bar_m - beta*(x_bar_m - x_bar_n) sigma_xx_hat_n <- sum((xn-mean(xn))^2)/n sigma_xx_hat_m <- sum((xm-mean(xm))^2)/m sigma_e_hat <- sum((ym - y_bar_m - beta*(xm - x_bar_m))^2)/m biout <- data.frame(x_bar_n,muy_hat,beta,sigma_xx_hat_n,sigma_e_hat) biout_no <- data.frame(x_bar_m, y_bar_m,beta,sigma_xx_hat_m,sigma_e_hat)

Similar codes for Quantitative score, Analytic score as x

#Bivariate normal test mydatas1 <- subset.matrix(mydatas_m,select=c(GVerbal,TOEFL)) res1 <- mardiaTest(mydatas1) #Henze-Zirkler's Multivariate Normality Test mvnPlot(res1, type = "persp", default = TRUE) # Perspective Plot mvnPlot(res1, type = "contour", default = TRUE) # Contour Plot mydatas2 <- subset.matrix(mydatas_m,select=c(GQuantitative,TOEFL)) res2 <- hzTest(mydatas2) #Henze-Zirkler's Multivariate Normality Test mvnPlot(res2, type = "persp", default = TRUE) mydatas3 <- subset.matrix(mydatas_m,select=c(GAnalytic,TOEFL)) res3 <- hzTest(mydatas3) #Henze-Zirkler's Multivariate Normality Test mvnPlot(res3, type = "persp", default = TRUE)

##Multivariate normality test hzTest(mydatas,cov = TRUE, qqplot = TRUE) #Henze-Zirkler's Multivariate Normality Test mardiaTest(mydatas, cov = TRUE, qqplot = TRUE) #Mardia's Multivariate Normality Test hzTest(multi_xndata,cov = TRUE, qqplot = FALSE) #Henze-Zirkler's Multivariate Normality Test mardiaTest(multi_xndata, cov = TRUE, qqplot = TRUE) #Mardia's Multivariate Normality Test

#Collinearity test -- if VIF>4 then assume multicollinearity then remove library(usdm) xn_data <- data.frame(multi_xndata) #Have to use data frame to use VIF vif(xn_data)

D.2 R Code of Simulation for Variance Comparison library(MASS)

#1-Bivariate mu <- c(420,540) #use GRE Verbal and TOEFL means Sigma <- matrix(c(3450,600,600,970),2,2) #Verbal and TOEFL covariance n0 <- 2000

#Sigma <- matrix(c(3450,1200,1200,970),2,2) #high correlation example set.seed(2017612) fobs <- mvrnorm(n0,mu=mu,Sigma=Sigma) #x0 and y0 cov(fobs) x0 <- fobs[,1] y0 <- fobs[,2]

114

fob_head <- head(fobs,1) x00 <- fob_head[,1] y00 <- fob_head[,2]

#Obtain estimates n <- 40 m <- 20 #m=20 - miss 50%; m=28 miss 30%; m=36 miss 10%; sig_xx_n <- 0 sig_xx_m <- 0 xbar_m <- 0 xbar_n <- 0 ybar_m <- 0 mu_y <- 0 beta <- 0 sig_e <- 0 set.seed(20176132) for (i in 1:10000) { simdata<- mvrnorm(n,mu=mu,Sigma=Sigma) x <- simdata[,1] y <- simdata[,2]

xbar_n[i] <- mean(x) sig_xx_n[i] <- sum((x-mean(x))^2)/n

sub_data <- simdata[1:m,1:2] xm <- sub_data[,1] ym <- sub_data[,2] sig_xx_m[i] <- sum((xm-mean(xm))^2)/m

xbar_m[i] <- mean(xm) ybar_m[i] <- mean(ym)

beta_num <- sum((ym-mean(ym))*(xm-mean(xm))) beta_den <- sum((xm-mean(xm))^2) beta[i] <- beta_num/beta_den mu_y[i] <- ybar_m[i] - beta[i]*(xbar_m[i]-xbar_n[i])

sig_e[i] <- sum((ym - ybar_m[i] - beta[i]*(xm - xbar_m[i]))^2)/m } beta_xbar_nCov <- cov(xbar_n,beta) #Check covariance beta_xbar_mCov <- cov(xbar_m,beta) mu_y_mu_x_cov <- cov(mu_y,xbar_n) beta_mu_yCov <- cov(beta,mu_y)

#Comparison of variance mu_y_hat c3 <- 1 + ((n-m)/(n*(m-3))) var_muy_hat <- (c3*sig_e)/(m-2) + ((beta^2)*sig_xx_n)/(n-1) var_muy_hat_m <- sig_e/(m-2) + ((beta^2)*sig_xx_m)/(m-1) m_var_muy_hat <- mean(var_muy_hat) var_muy_hat_sd <- (var(var_muy_hat))^0.5 m_var_muy_hat_m <- mean(var_muy_hat_m) var_muy_hat_m_sd <- (var(var_muy_hat_m))^0.5 len_out_var <- data.frame(n,m,m_var_muy_hat,var_muy_hat_sd,m_var_muy_hat_m,var_muy_hat_m_sd)

115

References

1. Allison, P. D. (2002), Missing Data, SAGE University Papers

2. Anderson, T.W. (1957) Maximum Likelihood Estimates for A Multivariate Normal

Distribution when Some Observations Are Missing, Journal of the American

Statistical Association, Vol. 52, No. 278 (Jun., 1957), pp. 200-203

3. Anderson, T.W. (2015) An Introduction to Multivariate Statistical Analysis, Third

Edition. Wiley. Reprint 2015.

4. Chung, Hie-Choon and Han,Chien-Pai.(2000) Discriminant Analysis When A

Block of Observations Is Missing, Annals of the Institute of Statistical

Mathematics, Vol. 52, No. 3, 544-556.

5. Edgett, G. L. (1956) Multiple Regression with Missing Observations Among the

Independent Variables, Journal of the American Statistical Association, Vol. 51,

No. 273 (Mar., 1956), pp. 122-131

6. Han, Chien-Pai and Li, Yan. (2011) Regression Analysis with Block Missing

Values and Variables Selection, Parkistan Journal of and Operation

Research, 7, 391-400.

7. Hogg, R.V., McKean, J.W. and Craig, A.T. (2013) Introduction to Mathematical

Statistics, 7th Edition, Pearson Education, Inc.

8. Howell, D.C. (2008) The analysis of missing data. In Outhwaite, W. & Turner, S.

Handbook of Social Science Methodology. London: Sage

9. Johnson, Richard A. and Wichern, Dean W. (1998) Applied Multivariate

Statistical Analysis, Fourth Edition, Prentice-Hall, Inc. NJ.

10. Kendall, M.G. and Stuart, A. (1945) The Advanced Theory of Statistics, Vol.1,

382-393, London.

116

11. Korkmaz, S., Goksuluk, D. and Zararsiz, G. (2016) Package ‘MVN’. URL

http://www.biosoft.hacettepe.edu.tr/MVN/

12. Kutner, Michael H., Nachtsheim, Christopher J., Neter, John and Li, William.

(2005) Applied Linear Statistical Models. Fifth Edition. McGraw-Hill Irwin.

13. Little, R. J. A., and Rubin, D. B. (2002), Statistical Analysis With Missing Data

,Second Edition,New York: Wiley

14. Little, R.J. and Zhang, Nanhua, (2011) Subsample Ignorable Likelihood for

Regression Analysis with Missing Data, Journal of Royal Statistical Society, Appl.

Statist. (2011) 60, Part 4, 591-605.

15. Loan, Charles Van. (2009) The Kronecker Product - A Product of the Times.

URL:

https://www.cs.cornell.edu/cv/ResearchPDF/KPHist.pdf, Presented at the SIAM

Conference on Applied Linear Algebra, Monterey, California, October 26, 2009

16. Nadarajah, S. and Gupta, A. K. (2005) A Skewed Truncated Pearson Type VII

Distribution, J. Japan Statist. Soc. Vol. 35 No. 1 2005 61–71

17. Nydick, Steven W. (2012) The Wishart and Inverse Wishart Distributions, URL:

https://pdfs.semanticscholar.org/ac51/ee74af59c432d493da98bd950cc6f856a0c

a.pdf, May 25, 2012

18. Papanicolaou, A. (2009) Taylor Approximation and the Delta Method. April 28,

2009.

19. Petersen, K. B. and Pedersen, M.S. (2012) The Matrix Cookbook,

http://matrixcookbook.com/

20. Sinsomboonthong, J. (2011) Jackknife Maximum Likelihood Estimates for a

Bivariate Normal Distribution with Missing Data, Thailand Statistician, July 2011;

9(2): 151 -169

117

21. Rawlings, John O., Pantula, Ssatry G. and Dickey, David A. (1998) Applied

Regression Analysis: A Research Tool. Second Edition. Springer, New York,

1998.

22. Rubin, D.B. (1976) Inference and Missing Data, , Vol 63, Issue 3

(Dec., 1976), 581 - 592

23. Sun, J., et al.(2010) Robust mixture clustering using Pearson type VII

distribution. Pattern Recognition Lett. (2010), doi:10.1016/ j.patrec.2010.07.015

118

Biographical Information

Yi Liu was born in Sichuan, China in 1963. She obtained her B.S. degree in

Physics and M.S. degree in Theoretical Physics from Beijing Normal University in 1984 and 1987, respectively. She worked for China Aerospace Engineering Consultation

Center from 1987 to 2008.

She enrolled in Department of Mathematics in University of Texas at Arlington in

2012, and obtained her M.S. and PhD degrees in Statistics from University of Texas at

Arlington in 2014 and 2017, respectively. She worked for Thomas J. Stephens &

Associates as a Biostatistician for clinical research from 2014 to 2015. She began to work for Sabre as data analytics since September 2015. Her current interests are big data and .

119