Stat331: Applied Linear Models

Helena S. Ven

05 Sept. 2019 Instructor: Leilei Zeng ([email protected]) Office: TTh, 4:00 – 5:00, at M3.4223

Topics: Modeling the relationship between a response variable and several explanatory variables via regression models. 1. Parameter estimation and inference 2. Confidence Intervals 3. Prediction 4. Model assumption checking, methods dealing with violations 5. Variable selection 6. Applied Issues: , Influential points, Multi-collinearity (highly correlated explanatory varia- bles) Grading: 20% Assignments (4), 20% Midterms (2), 50% Final Midterms are on 22 Oct., 12 Nov. Textbook: Abraham B., Ledolter J. Introduction to Regression Modeling Index

1 Simple 2 1.1 Regression Model ...... 2 1.2 ...... 2 1.3 Method of ...... 3 1.4 Method of Maximum Likelihood ...... 5 1.5 Properties of ...... 6

2 Confidence Intervals and Hypothesis Testing 9 2.1 Pivotal Quantities ...... 9 2.2 Estimation of the Response ...... 10 2.3 Prediction of a Single Response ...... 11 2.4 Analysis of and F-test ...... 11 2.5 Co¨efficient of Determination ...... 14

3 Multiple Linear Regression 15 3.1 Random Vectors ...... 15 3.1.1 Multilinear Regression ...... 16 3.1.2 Parameter Estimation ...... 17 3.2 Estimation of Variance ...... 19 3.3 Inference and Prediction ...... 20 3.4 Maximum Likelihood Estimation ...... 21 3.5 Geometric Interpretation of Least Squares ...... 21 3.6 ANOVA For Multilinear Regression ...... 21 3.7 Hypothesis Testing ...... 23 3.7.1 Testing on Subsets of Co¨efficients ...... 23 3.8 Testing General Linear Hypothesis ...... 25 3.9 Interactions in Regression and Categorical Variables ...... 25

4 Model Checking and Residual Analysis 27 4.1 ...... 27 4.1.1 Checking Higher Ordered Term ...... 28 4.1.2 Checking Constant Variance ...... 28 4.2 Q-Q Plot and Transformation ...... 29 4.3 Data Transformation ...... 29 4.3.1 Box-Cox Power Transformation ...... 29 4.4 Weighted Lease-Square Regression ...... 30 4.5 Outliers and Influential Cases ...... 31 4.5.1 Influential Case ...... 31

5 Model/Variable Selection 33 5.1 Automated Methods ...... 33 5.2 All subsets regressions ...... 34

1 Caput 1

Simple Linear Regression

Regression is a statistical technique for investigating and modeling the relationship between variables. The response variable (Dependent, outcome) is the variable of interest and we evaluate how it changes depending on the explanatory variable, (or Independent, Covariate, Predictor) which affects the response variables. By convention, the explanatory variables are represented by x and the response variable by y.

1.1 Regression Model

This course only studies about linear models. A general on d explanatory variables can be written as | y = β0 + β x +  = β0 + β1x1 + ··· + βdxd +  | {z } |{z} Deterministic Part Random Error

β0, β are the regression parameters or co¨efficients. Since the right hand side is a linear function, y must be a continuous variable. There are ways to convert this to a binary variable via e.g. the logistic function. The objective of linear regression is to determine the unknown β0, β parameters. Fortunately, all of the optimisation problems in this course are solvable analytically. We do not need e.g. gradient descent to find the unknown parameters.

1.2 Simple Linear Regression

The simplest case of linear regression is one with only one variable:

y = β0 + β1x + 

The 2D plot of (x, y) pairs is a . Suppose from a random of n subjects, we obtain (xi, yi). Substituting xi, yi into the linear regression equation gives

yi = β0 + β1xi + i

Assumptions: 1. The expectation of a random error is 0: E[i] = 0 2. The random error has equal variance 2 var i = σ

3. Independence: {1, . . . , n} are independent. 2 4. Distributional Assumption: i ∼ N(0, σ )

2 Hence i are IID normal. The normality assumption on i implies the normality of response yi, because

E[Yi] = E[β0 + β1xi + i] = β0 + β1xi + E[i] = β0 + β1xi

2 and also Yi ∼ N(β0 + β1xi, σ ) What is the interpretation of β0 and β1?

1. β0 is the average outcome when xi = 0.

2. β1 is more important. β1 is the expected/average change in y when x moves by 1.

B. Unfortunately in some cases the explanatory variable x is not allowed to take values close to 0, or moving x by 1 causes x to fall out of the of samples. Scaling is needed to produce a sensible interpetation.

1.3 Method of Least Squares

A common method of determining β0 and β1 is to use the least square estimate. i.e. Choose β0, β1 such that the sample errors n X 2 Sum of Squares of Errors := i i=1 is minimised.

Least square estimate amounts to minimising

n X 2 arg min (yi − (β0 + β1xi)) β0,β1 i=1 Fortunately there is an analytic solution of this nonlinear optimisation problem. 1.3.1 Proposition. The β0, β1 which attain the above minima are ˆ ˆ β0 :=y ¯ − β1x¯ P x y − nx¯y¯ βˆ := i i i 1 P 2 i xi − nx¯

Proof. Differentiating w.r.t. β0,

n n ∂ X 2 X (y − (β + β x )) = 2 (y − (β + β x )) (−1) ∂β i 0 1 i i 0 1 i 0 i=1 i=1

Similarly for β1, n n ∂ X 2 X (y − (β + β x )) = 2 (y − (β + β x )) (−x ) ∂β i 0 1 i i 0 1 i i 1 i=1 i=1

3 If we set these derivatives to zero, n 1 ∂ X  ˆ ˆ  0 = − MSE = yi − (β0 + β1xi) 2 ∂β0 ˆ βj =βj i=1 n 1 ∂ X  ˆ ˆ  0 = − MSE = yi − (β0 + β1xi) xi 2 ∂β1 ˆ βj =βj i=1 Expanding,

X ˆ ˆ X yi − nβ0 − β1 xi = 0 i i X ˆ X ˆ X 2 xiyi − β0 xi − β1 xi = 0 i i i ˆ If we solve for β0 in the first equation, 1 X 1 X βˆ = y − βˆ x 0 n i 1 n i i i ˆ =y ¯ − β1x¯

Substutiting this into the second equation,

X ˆ ˆ X 2 X ˆ 2 ˆ xiyi − (¯y − β1x¯)nx¯ − β1 xi = xiyi − nx¯y¯ + β1n(¯x) − β1 i i i Hence P x y − nx¯y¯ βˆ = i i 1 P 2 xi − nx¯ We can verify that the objective function is convex so this is indeed the global minimum.

Definition

The Sum of Squares of is X Sxy := (xi − x¯)(yi − y¯) i

Observe that Sxx can be expanded

n n X X Sxx = (xi − x¯)xi − x¯ (xi − x¯) i=1 i=1 | {z } =0 n X 2 2 = xi − nx¯ i=1 Likewise,

n X Sxy = (xiyi − xiy¯ − xy¯ i +x ¯y¯) i=1 n X = xiyi − nx¯y¯ − nx¯y¯ + nx¯y¯ i=1 n X = xiyi − nx¯y¯ i=1

4 Hence ˆ Sxy β1 = Sxx For each subject i, ˆ ˆ yˆi := β0 + β1xi is the estimated mean or fitted value for xi. The difference

ri := Observed − Fitted = yi − yˆi is the residue. ˆ ˆ From the minimisation conditions of β0, β1 we know that X X X ri = 0, rixi = 0, riyˆi = 0 i i i The last condition follows from X X ˆ ˆ ˆ X ˆ X riyˆi = ri(β0 + β1xi) = β0 ri + β1 rixi i i i i

What is the variance of (ri)? Recall that the sample variance is an unbiased of population variance:

n 1 X 2 sˆ := (x − x¯) x n − 1 i i=1 One degree of freedom is lost becausex ¯ is used as an estimator of the population mean. Another way to view this is that if we know x1, . . . , xn−1 andx ¯, then xn is automatically known. 2 We must recognise that each yi has a different distribution, i.e. Yi ∼ N(β0 + β1xi, σ ). The Mean Square Errors is a : n n 2 1 X 2 1 X 2 S := (β˜ + β˜ x − Yˆ ) = R n − 2 0 1 i i n − 2 i i=1 i=1 ˜ ˜ 2 degrees of freedoms are lost since we are estimating β0 and β1 using β0, β1. 1.3.2 Proposition. The MSE S2 is an unbiased estimator of σ2.

1.4 Method of Maximum Likelihood

Suppose there is a random variable Y and it has a density function f(y; θ) or mass function P (Y = y; θ). We infer θ by Y . If we observe y, the probability of θ = θˆ is the ( P (Y = y; θ) if discrete L(θ) = f(y; θ) if continuous

In essence we want to maximise P (θ|y). The maximum likelihood estimator is

θˆ := arg max L(θ) θ It is often easier to work with the logarithmic likelihood since L(θ) involves chains of multiplications.

`(θ) := log L(θ)

The score function is the derivative of `: ∂ s(θ) := `(θ) ∂θ

5 In simple linear regression, we assume each sample Yi is independent and has a Yi ∼ 2 N(β0 + β1xi, σ ). The joint distribution of Y1,...,Yn is

n Y 2 L(β0, β1, σ) = f(yi; β0 + β1xi, σ ) i=1 n   Y 1 1 2 = exp − (y − β − β x ) p 2 2 i 0 1 i i=1 2πσ 2σ The logarithmic likelihood is

n 2 n 2 1 X 2 `(β , β , σ ) = − log(2πσ ) − (y − β − β x ) 0 1 2 2 i 0 1 i 2σ i=1

The scores of β0, β1 are

n ∂` 1 X 2 = (y − β − β x ) ∂β 2 i 0 1 i 0 σ i=1 n ∂` 1 X 2 = (y − β − β x ) x ∂β 2 i 0 1 i i 1 σ i=1 ˆ ˆ We found that the maximum-likelihood estimators β0, β1 coincide with the least-square estimators. The maximum-likelihood estimator of σ2 is

n ∂` n 1 X 2 2 = − 2 + 4 (yi − β0 − β1xi) ∂σ 2σ 2σ i=1 If we set this to 0, n ∂` n 1 X 2 0 = 2 = − 2 + 4 (yi − β0 − β1xi) 2 2 ∂σ σ =ˆσ 2ˆσ 2ˆσ i=1 n n 1 X 2 = (y − β − β x ) 2 2 i 0 1 i 2ˆσ i=1 n 2 1 X 2 σˆ = (y − β − β x ) n i 0 1 i i=1 This is slightly different from the unbiased estimator which has a divisor of n − 2.

1.5 Properties of Estimators

The estimators above only give a point estimator. However, we may want to obtain insight about the distribution of the estimated quantities. e.g. 95% confidence. 0 The estimators (LSE or MLE) are linear functions of yis: S P (x − x¯)(y − y¯) βˆ = xy = i i i 1 P 2 Sxx (xi − x¯) P (x − x¯)y − y¯P (x − x¯) = i i i i i P 2 i(xi − x¯) P (x − x¯)y = i i i P 2 i(xi − x¯)

6 ˜ Since each Yi has a normal distribution and are independent, β1 has a normal distribution. Similarly we know that ˜ ¯ ˜ β0 = Y − β1x¯ has a normal distribution.

Note. xi’s are always fixed! (i.e. not random) Recall that if X,Y are independent, cov(X,Y ) = 0. The other direction holds when X,Y are normal. 1.5.1 Theorem. The point estimator β˜ has the distribution ! σ2 β˜ ∼ N β , 1 1 P 2 i(xi − x¯) ! ! 1 x¯2 β˜ ∼ N β , + σ2 0 0 n P 2 i(xi − x¯)

Proof. We know that β1, β0 must be normally distributed, so it only remains to determine their expectation and variance. Since P (x − x¯) [Y ] [β˜ ] = i i E i E 1 P 2 i(xi − x¯) P (x − x¯)(β x + β ) = i i 1 i 0 P 2 i(xi − x¯) P (x − x¯)(β x − x¯ +x ¯ + β ) = i i 1 i 0 P 2 i(xi − x¯) (β +x ¯)P (x − x¯)+ β P (x − x¯)(x − x¯) = 0 i i 1 i i i P 2 i(xi − x¯)

= β1

Under independence of Yi’s, the variance of sum is equivalent to the sum of variance Pn 2 ˜ i=1(xi − x¯) var Yi var β1 = 2 Pn 2 i=1(xi − x¯) 2 Pn 2 σ i=1(xi − x¯) = 2 Pn 2 i=1(xi − x¯) σ2 = Pn 2 i=1(xi − x¯) ˜ For β0,

˜ h ¯ ˜ i E[β0] = E Y − β1x¯ 1 X = [Y ] − x¯ [β˜ ] n E i E 1 i 1 X = (β + β x ) − xβ¯ n 0 1 i 1 i

= β0 + β1x¯ − xβ¯ 1

= β0

7 ˜ The variance of β0 is ˜ ¯ ˜ var β0 = var(Y − β1x¯) ¯ ¯ ˜ ˜ = var Y − 2 cov(Y, β1x¯) + var(β1x¯) σ2 = − 2¯x cov(Y,¯ β˜ ) +x ¯2 var β˜ n 1 1 This term is 0. To see this, ! 1 X P (x − x¯)Y cov(Y,¯ β˜ ) = cov Y , i i i 1 n i P 2 i i(xi − x¯) ! 1 X X = cov Y , (x − x¯)Y P 2 i i i n i(xi − x¯) i i 1 X = cov(Y , (x − x¯)Y ) P 2 i i j n i(xi − x¯) i,j | {z } ∝δi,j 1 X = (x − x¯) var Y P 2 i i n i(xi − x¯) i 2 σ X = (x − x¯) P 2 i n i(xi − x¯) i | {z } 0 Hence σ2 σ2 σ2 var β˜ = − 2¯x cov(Y,¯ β˜ )+¯x2 var β˜ = +x ¯2 0 n 1 1 n Pn 2 i=1(xi − x¯) as required. 1.5.2 Proposition.

σ2x¯ cov(β˜ , β˜ ) = − 0 1 P 2 i(xi − x¯) Proof. (Incomplete)

8 Caput 2

Confidence Intervals and Hypothesis Testing

2.1 Pivotal Quantities

In all of the distributions above, the σ2 parameter is unknown. The idea is to use the estimator

2 2 1 X 2 σ˜ = S = R n − 2 i i

The standard of β˜ is 1 s q σ2 var β˜ = 1 P 2 i(xi − x¯) ˜ The is an estimator of β1’s s S2 SE β˜ := 1 P 2 i(xi − x¯) Given the ˜ β1 − β1 q ∼ N(0, 1) 2 P 2 σ / i(xi − x¯) Replacing σ2 by S2 gives the t-distributed pivotal quantity

β˜ − β β˜ − β 1 1 = 1 1 ∼ t ˜ q n−2 SE β1 2 P 2 S / i(xi − x¯)

Similarly, β˜ − β 0 0 ∼ t ˜ n−1 SE β0

Definition

The density function for tk is

k+1 !− 2 Γ k+1  x2 f(x) := √ 2 1 + , (x ∈ ) k  k R kπΓ 2

9 When k → ∞, tk converges in distribution to N(0, 1). Using this distribution we can construct confidence intervals and conduct significance tests on β1. If

P (−tα/2 ≤ T ≤ +tα/2) = 1 − α where T ∼ tn−2, we can build the confidence interval  ˜ ˜ ˜ ˜  P β1 − tα/2 SE β1 ≤ β1 ≤ β1 + tα/2 SE β1 = 1 − α

The confidence interval is ˆ ˆ ˜ β1 ± tα/2SEβ1 To test whether the slope β has a specific value, say β∗, we use a t-test. The null hypothesis is ∗ H0 : β0 = β the is ∗ Ha : β0 6= β We can test the null hypothesis with the t : β˜ − β∗ T = 1 ˜ SE β1 Set the significance level α := 0.05, so if |t| > tα/2,n−2

The null hypothesis is rejected. The p-value is, given H0, the probability of observing values at least as extreme as β∗: p = P (|T | ≥ |t|) A common usage of hypothesis testing is to prove a linear relationship between two variables. ∗ This procedure can be modified to give a one-sided test, where we test the hypothesis H0 : β1 ≤ β and ∗ ∗ Ha : β1 > β . This should only be carried out if we have theoretical grounds to believe β1 < β is not possible. ˜ The variance of β1 can be reduced by spreading out the xi’s, thus giving a sharper estimate.

2.2 Estimation of the Mean Response

We can also consider the estimate of [y|x ] = µ = β + β x E 0 x0 0 1 0 We will estimate this with µ˜ = β˜ + β˜ x x0 0 1 0 The variance is varµ ˜ = var(β˜ + β˜ x ) x0 0 1 0 ¯ ˜ ˜ = var(Y − β1x¯ + β1x0) ¯ ˜ = var(Y − β1(x0 − x¯)) ¯ 2 ˜ = var Y + (x0 − x¯) var β1 σ2 (x − x¯)2σ2 = + 0 n Sxx ! 1 (x − x¯)2 = + 0 σ2 n Sxx as always this gives rise to the pivotal quantities: µ˜ − µ µ˜ − µ x0 x0 x0 x0 ∼ N(0, 1), ∼ tn−2 r  2  r  2  σ2 1 + (x0−x¯) S2 1 + (x0−x¯) n Sxx n Sxx

10 2.3 Prediction of a Single Response

Suppose we have sample xp and we wish to predict the corresponding Yp using the formula ˜ ˜ ˜ Yp := β0 + β1xp

This is unbiased, since ˜ ˜ ˜ E[Yp] = E[β0] + E[β1]xp = β0 + β1xp = E[Yp] ˜ Moreover, p cannot be correlated to any other , so cov(p, βi) = 0. ˜ ˜ ˜ var(Yp − Yp) = var(p + β0 + β1xp − β0 − β1xp) ˜ ˜ = var(p + (β0 + β1xp)) ˜ ˜ = var(p) + var(β0 + β1xp) 2 = σ + var(˜µp) ! 1 (x − x¯)2 = 1 + + 0 σ2 n Sxx

˜ 2 2 The SE of Yp − Yp is the same expression with σ replaced by S .

2.4 and F-test

The analysis of variance (ANOVA) approach is based on the partitioning of total variability in responses.

Definition

The total sum of squares (SST) is

n X ¯ 2 SST := (Yi − Y ) = Syy i=1

The Error/ (SSE) is

n X ˜ 2 SSE := (Yi − Yi) i=1

The total sum of squares can be partitioned by

n X ˜ ˜ ¯ 2 SST = ((Yi − Yi) + (Yi − Y )) i=1 n n n X ˜ 2 X ˜ ¯ X ˜ ¯ 2 = (Yi − Yi) + 2 Ri(Yi − Y ) + (Yi − Y ) i=1 i=1 i=1 | {z } 0 n n X ˜ 2 X ˜ ¯ 2 = (Yi − Yi) + (Yi − Y ) i=1 i=1 = SSE + SSR

• SST has n − 1 degrees of freedom, lost to Y¯ . ˜ ˜ • SSE has n − 2 degrees of freedom, lost to β0, β1.

11 • SSR has 1 degree of freedom. Although there are n terms, all fitted values are from the same line, since

X ˜ ¯ (Yi − Y ) = 0 i

X ¯ ˜ ˜ ¯ ˜ X (Y − β1x¯ +β1xi − Y ) = β1 (xi − x¯) = 0 | {z } i ˜ i β0 | {z } 0 also we can re-write SSR as X ˜ ¯ 2 ˜2 X 2 ˜2 SSR = (Yi − Y ) = β1 (xi − x¯) = β1 Sxx i i

2.4.1 Proposition.

2 Sxy ˜2 SSR = = β1 Sxx Sxx 2 Sxy SSE = Syy − Sxx Proof. We have  2 ˜2 Sxy SSR = β1 Sxx = Sxx Sxx and

n X ˜ 2 SSE = (Yi − Yi) i=1 n X ˜ ˜ 2 = (Yi − β0 −β1xi) |{z} i=1 ¯ ˜ Y −β1x¯ n X ¯ ˜ 2 = (Yi − Y − β1(xi − x¯)) i=1 n n n X ¯ 2 ˜ X ¯ ˜2 X 2 = (Yi − Y ) − 2β1 (Yi − Y )(xi − x¯) + β1 (xi − x¯) i=1 i=1 i=1 2 Sxy Sxy = Syy − 2 Sxy + 2 Sxx Sxx Sxx 2 Sxy = Syy − Sxx

We can summarise this situation using an ANOVA table: Source SS dof MS (Mean square) Regression SSR 1 MSR = SSR/1 = SSR Error SSE n − 2 MSE = SSE/(n − 2) Total SST n − 1

2 In previous sections it was established that E[MSE] = σ . 2.4.2 Proposition. 2 2 E[MSR] = σ + β1 Sxx

12 Proof. ˜2 E[MSR] = E[SSR] = E[β1 Sxx] ˜ ˜ 2 = Sxx(var β1 + E[β1] ) 2 σ 2 = Sxx( + β1 ) Sxx 2 2 = σ + Sxxβ1

If β1 6= 0, then we can expect the difference between MSR and MSE to be large. An immediate question is, how large? Under H0 : β = 0, it can be shown that MSR F := ∼ F MSE 1,n−2 which is a F-distribution with degree of freedoms 1 and n − 2. Proof. We have that SST 2 ∼ χn−1 σ2 so X ¯ 2 SST = (Yi − β0) − (Y − β0) i X 2 X ¯ X ¯ 2 = (Yi − β0) − 2 (Yi − β0)(Y − β0) + (Y − β0) i i i X 2 ¯ 2 ¯ 2 = (Yi − β0) − 2n(Y − β0) + n(Y − β0) i X 2 ¯ 2 = (Yi − β0) − n(Y − β0) i

Dividing both sides by σ2, SST n(Y¯ − β )2 P (Y − β )2 + 0 = i i 0 σ2 σ2 σ2 2 Under H0, Yi ∼ N(β0, σ ) IID. The sum of squares of independent normals is χ2. Since

2 Yi − β0 (Yi − β0) 2 ∼ N(0, 1), ∼ χ1 σ σ2 so 2 X (Yi − β0) 2 2 ∼ χn i σ ¯ 2 Similarly, Y ∼ N(β0, σ /n), so ¯ 2 (Y − β0) 2 ∼ χ1 σ2/n By Cochran’s theorem, SST 2 ∼ χn−1 σ2 We can show that the two sums on the left are independent (omitted). ˜ 2 We know β1 ∼ N(0, σ /Sxx) so ˜2 SSR β1 Sxx 2 = ∼ χ1 σ2 σ2

13 and SSE 2 ∼ χn−2 σ2 Since SST SSE SSR = + σ2 σ2 σ2

|{z2 } |{z2 } χn−1 χ1 so by Cochran’s theorem again, SSE 2 ∼ χn−2 σ2 and SSE/σ2, SSR/σ2 are independent. The F distribution is defined to be the ratio of two χ2 distributions, so

SSR/σ2/1 MSR F := = ∼ F1,n−2 SSE/σ2/(n − 2) MSE

The F-test can be used to test the hypothesis H0 : β1 = 0. Reject H0 if F > fα,1,n−2. F-test for H0 : β1 = 0 is equivalent to the t-test for β1. 2.4.3 Proposition. 2 If Z ∼ tp, then Z ∼ F1,p The definition of SST, SSE, SSR are general and apply for multiple linear regression.

2.5 Co¨efficient of Determination

F-test can be used to test the significance of a linear model, or goodness-of-fit.

Definition

The co¨efficient of determination is SSR R2 := , 0 ≤ R2 ≤ 1 SST

R2 represents the percentage of the total variation in responses that can be explained by the linear model. The larger R2 is, the more the linear model fits to the data. When R2 = 1, all data are on a straight line. 2 ˆ R = 0 when β1 = 0 but the data has large variations.

14 Caput 3

Multiple Linear Regression

3.1 Random Vectors

For any set of random variables V1,...,Vn,

| V := [V1,...,Vn] is a random vector.

Definition

If V is a random vector, the of V is

var V := E [(V − E[V ])(V − E[V ])|]   = cov(Vi,Vj) i,j

The covariance matrix is symmetric. If var V is diagonal, then the elements are uncorrelated (but not necessarily independent). Let V be a random vector and Σ its covariance matrix.

3.1.1 Proposition. Let A, b be a matrix and a column vector of constants. Then

E[AV + b] = AE[V ] + b and var(AV + b) = A(var V )A|

Definition

A vector V has a multivariate normal distribution N(µ, Σ) if its density function has the form   − n − 1 1 -1 f(v) = (2π) 2 |Σ| 2 exp − (v − µ)Σ (v − µ)| 2

where µ := E[V ], Σ := var V

3.1.2 Proposition. If V ∼ N(µ, Σ),

• Its elements Vi ∼ N(µi, Σi,i)

15 • AV + b ∼ N(Aµ + b, AΣA|)

V  • If we partition V = 1 , then V 2

V 1 ∼ N(µ1, Σ1,1), V 2 ∼ N(µ2, Σ2,2)

2 • If Ui ∼ N(µi, σi ) are independent, then U ∼ N(µ, Σ)

with Σ = diag(σ1, . . . , σn).

• If Σ is diagonal, then Vi’s are independent. • Let U := AV , W := BV . Then U , W are independent if and only if

AΣB| = 0

3.1.1 Multilinear Regression In multilinear regression, the samples are generated by

p X Yi = β0 + βjxi,j + i j=1

2 where i ∼ N(0, σ ) is the random error term. This can be viewed in matrix form       y1 β0 + β1x1,1 + ··· + βpx1,p 1  .   .   .   .  =  .  +  .  yn β0 + β1xn,1 + ··· + βpxn,p n     1 x1,1 ··· x1,p β0   1 1 x2,1 ··· x2,p  β1 =     +  .  . . .. .   .   .  . . . .   .  n 1 xn,1 ··· xn,p βp Or, Y = Xβ +  • X is the design matrix

• β is the vector of parameter •  is the error vector • Y is the response vector

The covariance matrix of  is the diagonal matrix

var  = σ2I so  ∼ N(0 , σ2I), and Y ∼ N(Xβ, σ2I) Now we are ready to derive our estimators.

16 3.1.2 Parameter Estimation Using the least squared method, we want to find β˜ such that

2 ˜ ˜ 2 β := arg min Y − Y = arg min kY − Xβk2 β 2 β n X 2 = arg min Yi − (β0 + β1xi,1 + ··· + βpxi,p) β i=1 = (Y − Xβ)|(Y − Xβ)

Differentiating w.r.t. β, ∂ ∂ (Y − Xβ)|(Y − Xβ) = (Y | − β|X|)(Y − Xβ) ∂β ∂β ∂ = (Y |Y − Y |Xβ − β|X|Y + β|X|Xβ) ∂β ∂ = (−Y |Xβ − β|X|Y + β|X|Xβ) ∂β = −X|Y − X|Y + (X|X + X|X)β

where we have used the derivative of quadratic form ∂ (β|Aβ) = (A + A|)β ∂β Then if we set the derivative to 0, X|Xβ˜ = X|Y so β˜ = (X|X)-1X|Y is the least square estimate of β. This brings out the question on when (X|X) is invertible. When X has full column rank, X|X is invertible, so n > p + 1. (If n = p + 1 the model is perfect fit) The fitted value vector is Y˜ := Xβ˜ = X(X|X)-1X| Y | {z } H H is the hat matrix and has some properties: 3.1.3 Proposition. • H is symmetric. • H2 = H (idempotence)

• H is a . Proof.

HH = X(X|X)-1X|X(X|X)-1X| = X(X|X)-1X| = H

A symmetric idempotent matrix is a projection matrix.

3.1.4 Properties of LSE. 1. Unbiasedness: E[β˜] = β 2. var β˜ = σ2(X|X)-1

17 3. β˜ ∼ N(β, σ2(X|X)-1) Proof.

1. -1 -1 E[β˜] = E[(X|X) X|Y ] = (X|X) X|Xβ = β 2.

var β˜ = (X|X)-1X|(var Y )X(X|X)-1 = (X|X)-1X|σ2IX(X|X)-1 = σ2(X|X)-1

3. This is because β˜ is a linear combination of Y , so it must be normal.

The residuals are R = Y − Y˜ = (I − H)Y Notice that  Pn  i=1 Ri Pn  i=1 Rixi,1 X|R =    .   .  Pn i=1 Rixi,p and

X|R = X|(I − H)Y = X|Y − X|X(X|X)-1 X|Y | {z } I = 0

Pn Thus i=1 Rixi,j = 0 for any j, and the sum of residuals is 0. 3.1.5 Proposition. Pn ˜ | ˜ i=1 RiYi = R Y = 0 3.1.6 Proposition. β˜ and R are independent.

Proof. β˜ and R are linear functions of , which is normal. Hence β˜, R are jointly normally distributed. Therefore if they are uncorrelated, then they are independent.

E[β˜] = β, E[R] = E[Y ] − XE[β˜] = 0 Define Ξ := (X|X)-1X| so that β˜ = ΞY and ΞX = I.

cov(β˜, R) = E[(β˜ − β)R|] = E[β˜R|] − β E[R]| | {z } 0 | = E[ΞY (Y − Xβ˜)|] = ΞE[Y (Y − XΞY )|] = ΞE[YY |](I − Ξ|X|)

18 Observe that 2 σ I = var Y = E[YY |] − E[Y ]E[Y ]| = E[YY |] − Xββ|X| Therefore,

cov(β˜, R) = ΞE[YY |](I − Ξ|X|) = Ξ(σ2I + Xββ|X|)(I − Ξ|X|) = σ2Ξ + ΞXββ|X| − σ2ΞΞ|X| − ΞXββ|X|Ξ|X| = σ2Ξ + ββ|X| − σ2ΞX (X|X)-1X| −ββ|X| | {z } Ξ = 0

3.2 Estimation of Variance

3.2.1 Theorem.

n 2 1 X 2 SSE σ˜ := R = = MSE n − p − 1 i n − p − 1 i=1 is an unbiased estimator of σ2. Proof. Notice that E[R] = E[Y ] − E[Xβ˜] = Xβ − Xβ = 0 so var R = E[RR|]

2 E[kRk ] = E[tr(RR|)] = tr E[RR|] = tr var R = tr var((I − H)Y ) = tr((I − H) (var Y )(I − H)) | {z } σ2I 2 2 = σ tr(In − H) 2 = σ tr(In − H)   = σ2 n − tr(X(X|X)-1X|) = σ2(n − tr((X|X)-1X|X)) 2 = σ (n − tr Ip+1) = σ2(n − p − 1) so [kRk2] σ2 = E n − p − 1

19 3.3 Inference and Prediction

• β0 can be interpreted as “mean response of Y when all inputs variables x = 0 ”

• βj: Changes in mean response when xj increases by 1 unit, while holding all other explanatory variables fixed.

Hypothesis H0 : βj = 0 “explanatory variable xj is not significant”. To conduct hypothesis tests, we use the pivotal quantity ˜ ˜ βj − βj βj − βj q = q ∼ tn−p−1 2 | -1 ˜ σ˜ ((X X) )j,j ˜varβj

˜ The standard error of βj is q ˜ ˜ ˜ SEβj = ˜varβj We can also conduct inference on a linear combination of co¨efficients. Let c be a vector of constants. The scalar c|β˜ has a normal distribution. In particular,

c|β˜ ∼ N(c|β, c|(σ2(X|X)-1)c) so we have the pivotal quantity c|β˜ − c|β q ∼ tn−p−1 c|(˜σ2(X|X)-1)c

| For example, to estimate the mean response at x h := [1, x1, . . . , xp] , set

| ˜ µ˜h := x hβ

Using x h = (0, 1, −1, 0,..., 0), we can test β1 − β2 = 0. To predict for a new subject with x h:

| ˜ • Prediction: ˜y h = x hβ = µ˜h

• Prediction error: y h − ˜y h. Hence

| | ˜ E[y h − ˜y h] = E[x hβ + h − x hβ] | | = x hβ + 0 − x hβ = 0

• Prediction variance:

| ˜ var(y h − ˜y h) = var(h) + var(x hβ) 2 | ˜ = σ + x h(var β)x h   2 | | -1 = σ 1 + x h(X X) x h

• Hence, we have the pivotal quantity

y h − ˜yh q ∼ tn−p−1 | | -1 σ˜ 1 + x h(X X) x h

20 3.4 Maximum Likelihood Estimation

The likelihood of sampling Y is

− 1   − n 2 2 1 2 -1 L = f(Y ) = (2π) 2 σ I exp − (Y − Xβ)|(σ I) (Y − Xβ) 2

A parameter βˆ which maximises L is the maximal a-posteriori estimator of β given a uniform prior. The logarithmic likelihodd is n n 1 ` = − log(2π) − log σ2 − (Y − Xβ)|(Y − Xβ) 2 2 2σ2 The score is ∂ 1 ∂ ` = (Y − Xβ)|(Y − Xβ) ∂β 2σ2 ∂β ∂` if we solve for ∂β = 0 , we get the same estimator as the least squares estimator.

3.5 Geometric Interpretation of Least Squares

| Let Y be the response vector, x j = [x1,j, . . . , xn,j] . Then the design matrix is

X = [1 , x 1,..., x p]

n n Y can be viewed as a point in R . img X (the set of linear combinations of columns of X), is a subspace of R . Xβ is one such linear combination. The difference Y − Xβ is the error term for which we wish to minimise. The projection of Y onto X is Xβ˜. The perpendicular component

R := Y − X β˜ is orthogonal to every vector in img X, so X|R = 0 . From this we have

X|Y = X|Xβ˜

From which we can recover the formula β˜ = (X|X)-1X|Y

Y

R

0 Y˜ img X

3.6 ANOVA For Multilinear Regression

Consider a multiple linear model

yi = β0 + β1xi,1 + ··· + βpxi,p + i, (i = 1, . . . , n)

We still have 3 sources of variation:

21 Pn ¯ 2 • Total: SST = i=1(Yi − Y ) Pn ˜ ¯ 2 • Regression: SSR = i=1(Yi − Y ) Pn ˜ 2 • Error: SSE = i=1(Yi − Yi) The sum of squares and degree of freedoms add as before, in particular

• SST = SSE + SSR • dof SST = dof SSE + dof SSR

dof SST = n − 1 dof SSE = n − p − 1 dof SSR = p

We can summarise this situation using an ANOVA table: Source SS dof MS (Mean square) Regression SSR p MSR = SSR/p Error SSE n − p − 1 MSE = SSE/(n − p − 1) Total SST n − 1 Recall that SSE σ˜2 = = MSE n − p − 1 Suppose we want to evaluate the “overall” significance of the multiple regression model, i.e. we want to test

H0 : β1 = ··· = βp = 0

Ha : ∃j : βj 6= 0

If H0 is true, then SSR should be a lot smaller than SSE, Hence we can conduct a F-test. In the p = 1 case, F-test is equivalent to t-test on β1, but in higher dimensions the interpretation of the two tests diverge. 3.6.1 Proposition.

MSR SSR/p H F = = ∼0 F MSE SSE/(n − p − 1) p,n−p−1

Proof. Under H0, • SST ∼ χ2 σ2 n−1 • SSE ∼ χ2 σ2 n−p−1 • Via Cochran’s Theorem, since SST SSE SSR = + σ2 σ2 σ2 so SSR ∼ χ2 σ2 p • SSE, SSR are independent.

• Hence by the definition of Fp,n−p−1, F ∼ Fp,n−p−1.

22 If we reject H0, we conclude that at least one of the regression co¨efficients is non-zero. If we fail to reject H0, we do not have enough evidence to conclude that none of βj is important. The co¨efficient of determination is still SSR SSE R2 = = 1 − SST SST which still represents the proportion. R2 always increses as more explanatory variables are added. Because R2 increases in the # of covariates, it alone cannot be used as a meaningful comparison of models with very different number of explanatory variables. Bigger R2 may not always mean “better” model. The adjusted R2, which is included in the lm R command, is n − 1 R2 = 1 − (1 − R2) adj n − p − 1 2 Radj: higher the better. However, this loses the interpretation of the proportion of variation explained by the model.

3.7 Hypothesis Testing

We may wish to test hypotheses of the form β3 = β4 = 0, for the purpose of eliminating explanatory variables. Another type of test is of the form H0 : β1 = β2 = β3, β4 = 0 Hypothesis Pivotal Quantity ∗ H0 : βj = β t-test H0 : β1 = ··· = βp = 0 F-test H0 : βi = βj = 0 F-test

3.7.1 Testing on Subsets of Co¨efficients Consider the design matrix X, partitioned into two parts X = [1 , x ,..., x |x , ··· , x ] = [X |X ] 1 p1 p1+1 p1+p2 1 2

The full model can be written as (partitioning β into β1, β2)

(MF ): Y = Xβ +  = X1β1 + X2β2 +  The reduced model is (MR): Y = X1β1 + 

When H0 : β2 = 0 , the term X1β2 becomes 0. • β has p + 1 parameters

• β1 has p1 + 1 parameters

• β2 has p2 parameters.

When we fit models MF ,MR, we obtain two sets of residuals: ˜ (MF ): Y = Y F + RF ˜ (MR): Y = Y R + RR The residual in the reduced model is ˜ ˜ RR = Y F − Y R − RF Hence the response vector can be decomposed into ˜ ˜ ˜ Y = Y R + (Y F − Y R) + RF and we have

23 ˜ • RF ⊥ Y R ˜ • RF ⊥ Y F

Since LSE projects Y onto the subspace img X, and img X ⊇ img X2. ˜ ˜ RF is orthogonal to Y R, Y F , so it must also be orthogonal to its difference: ˜ ˜ RF ⊥ (Y R − Y F )

Moreover,

˜ | 0 = Y RRR ˜ | ˜ ˜ = Y R(Y F − Y R + RF ) ˜ | ˜ ˜ ˜ | = Y R(Y F − Y R) + Y RRF | {z } 0 so ˜ ˜ ˜ Y R ⊥ (Y R − Y F ) Recall the Pythagorean Identity, Pythagorean Identity. If a 1,..., a k are orthogonal, 2 k k X X 2 a i = ka ik i=1 i=1 Since Y is a sum of 3 orthogonal vectors,

2 2 2 ˜ ˜ ˜ 2 kY k = Y R + Y F − Y R + kRF k

The resulting ANOVA table is (note that this is different from SSP)

Source SoS dof 2 ˜ X1 Y R p1 + 1 2 ˜ ˜ X2|X1 Y F − Y R p2 2 Error kRF k n − p − 1 Total kY k2 n

In this table, •k Y k2 is the unadjusted SoS (i.e. second raw )

2 •k RF k = SSEF is the SSE for the full model.

• X2|X1 means contribution to the total SoS by X2 after adjusting for X1. ˜ ˜ • Invoking Pythagorean Identity on RR = (Y F − Y R) + RF , 2 ˜ ˜ 2 2 Y F − Y R = kRRk − kRF k = SSER − SSEF

Under H0 : β2 = 0 , we have the pivotal quantity

2 Extra SoS ˜ ˜ z }| { Y F − Y R /p2 (SSE − SSE )/p F := = R F 2 ∼ F 2 p2,n−p−1 kRF k /(n − p − 1) SSEF /(n − p − 1)

24 Thus we reject H0 when F > F α,p2,n−p−1 Null hypothesis of the following form is a special case:

H0 : β1 = ··· = βp = 0

Under H0, the reduced model is yI = β0 + i ˆ so we get β0 =y ¯. Moreover, X ˆ 2 X ¯ 2 SSER = (Yi − Yi) = (Yi − Y ) = SST i i The pivotal quantity is

(SST − SSEF )/p SSRF /p MSRF F := = = ∼ Fp,n−p−1 SSEF /(n − p − 1) SSEF /(n − p − 1) MSEF which is equivalent to the F -test of overall significance.

3.8 Testing General Linear Hypothesis

The additional sum of squares principle can be used to test general linear hypothesis of the form

H0 : Cβ = b,Ha : Cβ 6= b For example, we can test H0 : β1 = β2 = β3 ∧ β4 = 0 This can be written as 0 1 −1 0 0 H0 : 0 1 0 −1 0 β = 0 0 0 0 0 1 | {z } C We fit the reduced model yi = β0 + β1xi1 + β1xi2 + β3xi3 + i How to fit the model? We can create a new variable

zi := xi1 + xi2 + xi3 and fit the one dimensional model y = β0 + β1z instead. The degree of freedom we lost is equal to the number of constraints, rank C:

(SSER − SSEF )/ rank C F = ∼ Frank C,n−p−1 SSEF /(n − p − 1) Unfortunately, there is no general formula for the reduced model given C and b.

3.9 Interactions in Regression and Categorical Variables

An effect is represented as a product of two or more variables. For example:

y = β0 + β1x1 + β2x2 + β3x1x2 + 

The product x1x2 is the interaction between x1 and x2. β3 is the effect of the interaction term. β3 is the interaction effect of x1x2. If the effect of this term is significant, x1, x2 are significant factors even if the main effect of x1 or x2 is not significant. As long as the interaction is significant, do not drop constituents of the interaction term. i.e. If β3 6= 0, either of x1, x2 cannot be dropped no matter what inference on β1, β2 provides. Higher orders of interactions can be added as well, but they are difficult to interpret.

25 Definition

Categorical variables (or factor or qualitative variables) are variables that classify study subjects into groups.

Categorical variables require special attention in because unlike continuous variables, they cannot enter into the regression equation in their raw form. Categorical variables need to be encoded into a set of binary variables which can then be entered into the regression model. This is the dummy coding or one-hot coding and can be done automatically by statistical software. e.g. 1 0 0 professional := 0 , whitecollar := 1 , bluecollar := 0 0 0 0

Testing significance of the categorical predictor in this model is equivalent to testing β1 = β2. Note that when x2 ∈ {0, 1}, the interaction term essentially allows two regression models to be bundled into one. This is because:

y(x1, 0) = β0 + β1x1 + β20 + β3x10 +  = β0 + β1x1

y(x1, 1) = β0 + β1x1 + β21 + β3x11 +  = (β0 + β2) + (β1 + β3)x1

Thus

• Testing for equal slope: H0 : β3 = 0

• Testing for equal intercept: H0 : β2 = 0

• Testing for equal slope and intercept: H0 : β2 = β3 = 0.

26 Caput 4

Model Checking and Residual Analysis

A regression model yi = β0 + β1xi1 + ··· + βpxip + i is specified under several model assumption. Ordered from the most to least important,

1. Relationship is linear, E[i] = 0 2 2 2. var i = σ for some fixed σ

3. i are independent.

4. i are normally distributed.

4.1 Errors and Residuals

Notice that

R = (I − H)Y = (I − H)(Xβ + ) = Xβ − X(X|X)-1X|Xβ + (I − H) = (I − H)

Hence E[R] = (I − H)E[] = 0 Moreover, since I − H is a projection matrix,

var R = (I − H)(var )(I − H)| = σ2I(I − H)2 = σ2(I − H)

Since  is multivariate normal, R ∼ N(0 , σ2(I − H)) If the hat matrix H is relatively small compared to I, then R ' . The residual approximate the (unobservable) random error R ' N(0 , σ2I) Any patterns found in residuals may suggest similar ones to errors.

Definition

The residual is R := Y − Y˜ . The standardised residual is R R(s) := σˆ The Studentised residual is Ri Di := p σˆ 1 − Hi,i

27 The basis of Studentisation is that 2 Ri ∼ N(0, σ (1 − Hi,i)) which gives the approximate pivotal quantities R R i ∼ N(0, 1), i ' N(0, 1) q 2 q 2 σ (1 − Hi,i) σˆ (1 − Hi,i)

Note that Di is not necessarily student. 95% of studentised residuals should lie within [−2, +2]. This can be tested using a . Irrespective of whether or not the standard assumptions about the random errors hold, we have previously shown that X X ˆ Rixi,j = 0, RiYi = 0 i i This is a pure consequence of least-square estimate and does not rely on model assumptions. Hence ˆ R ⊥ x j, R ⊥ Y

This suggest that we should not see any linear patterns on (x j, r) and (y, r). If any pattern arises in these grpahs, it may be due to the violation of certain standard model assumptions. Suppose we fit a linear model such as y = β0 + β1x + . The “true” relationship is unknown and could contain higher order terms 2 y = β0 + β1x + β2x +  This is still a linear regression model, albeit with p = 2.

4.1.1 Checking Higher Ordered Term If y is a polynomial in x, check if the fitted model is adequate is equiavlent to check if the relationship is linear in this case. Diagnostic plots: • Plot residuals vs case order

• Plot r vs. x. • Plot r vs. sample index i. If the residuals are randomly scattered, the model is adequate. Curvature may suggest the need of x2 or higher order polynomial terms.

Definition

For a covariate xj, the partial residuals are

(j) ˜ Ri := Ri + βjxi,j, (i = 1, . . . , n)

The estimated effect of covariate xj is added back into the residuals. (j) The plot of ri w.r.t. xj is a partial residual plot. If the relationship is linear, xj should enter the model as a linear term. Otherwise it should not.

4.1.2 Checking Constant Variance

The plot of ri w.r.t.y ˆi can show the scedasticity of the model. If the random scatter pattern appears to be symmetric w.r.t.y ˆi, this suggests homo-scedasticity. Otherwise the variance of random errors are non- constant, i.e. hetero-scedasticity. A marginal relationship plot is a plot of pairs in the set {x1, . . . , xp, y}. Note. All models are wrong, but some are useful.

28 4.2 Q-Q Plot and Data Transformation

If the random errors e is IID normal,

• Residuals r1, . . . , rn should look like a random sample from a normal distribution.

• Ordered redisuals r(1) < ··· < r(n) should look approximately like the quantiles of a normal distribution. The Q-Q Plot is a plot of sample quantiles w.r.t. theoretical quantiles. If the residuals form an approximate straight line, the normality is satisfied.

4.3 Data Transformation

Consider the model yi = µi + i. Suppose we know that the true variance is 2 2 var yi = h(µi) σ where h is a known function and σ is a constant. We want to find a function g which transforms yi in such a way that makes yi’s homoscedastic. Recall the linear approximation 0 2 0 g(yi) = g(µi) + g (µi) + O(i ) ' g(µi) + g (µi)(yi − µi) so 0 var g(yi) ' var(g(µi) + g (µi)(yi − µi)) 0 2 = g (µi) var i 0 2 2 2 = g (µi) h(µi) σ If this should approximate σ2, we need 0 1 g (µi) = h(µi) Examples: √ √ • If h(µi) = µi, then g(yi) = yi has approximate constant variance.

• If h(µi) = µi, then g(yi) = log yi has approximate constant variance.

4.3.1 Box-Cox Power Transformation Box-Cox Transformation is a way to transform non-normal response variable to a normal shape. 1. Transform the population by ( λ y −1 y˜λ−1 λ 6= 0 y∗ := λ (log y)˜y λ = 0

wherey ˜ is the of yi’s. 2. The λ should be chosen so that the fitted model produces the largest log-likelihood value (or smallest SSE value). 3. After the λ is set, use the poer transformation of responses in the analysis: ( yλ λ 6= 0 y0 := log y λ = 0

Note. R’s boxcox(fit) function can be used to calculate λ from a fitted model. The issue with power transformation is interpretation. It can be hard to interpret things such as volume0.3. A logarithmic response log(y) = β0 + β1x1 +  can be interpreted as y = eβ0 eβ1x1 e In this repect the differences between log y’s can be interpreted as ratios.

29 4.4 Weighted Lease-Square Regression

One of the common assumptions underlying linear regression methods is that the variance of the random error term is constant over all values of explanatory variables (). This assumption does not hold in many applications. The (OLS) procedure treats all of the data equally. Let w be a weight vector. The weighted least-squares (WLS) minimises n X | 2 wi(yi − x i β) i=1 Suppose the modle under consideration is | Yi = x i β + i 2 where i’s are independent normal with mean 0 and var i = ciσ . Define the weight wi := 1/ci for each subject. We can re-write the above model as √ √ √ | wiYi = wix i β + wii √ 2 Now the transformed random error has var wii = σ . In general, define the weight matrix W := diag(w). The weighted sum of squares of errors can be written as n √ √ 2 X | 2 1/2 1/2 ( wiYi − wix i β) = W Y − W Xβ i=1 Thus the WLS estimator of β is 2 β˜ = arg min W1/2Y − W1/2Xβ β  -1 = (W1/2X)|(W1/2X) (W1/2X)|W1/2Y

= (X|WX)-1 X|WY Since var Y = var  = σ2W-1, the variance fo β˜ is var β˜ = σ2(X|WX)-1 Notice that:

• The weight wi = 1/ci is inversely proportional to error variance • An observation with small error variance contains more precise information and hence is given a larger weight. • The subject specific weights need to be known in order to obtain estimates.

• In designed , sometimes the weights wi are known. The response is the average of repeated observations of a variable, e.g. k 1 Xi Y = Y i k i,j i j=1 then we have σ2 var i ki and wi = ki.

In some cases, the wi’s need to be estimated. If the plot of r w.r.ty ˆ (or xj) shows a “fan” shape, the we regress |r| ony ˆ (or xj): |ri| = β0 + β1yˆi + 

The resulting fitted values,r ˆi, are used to obtain the weights 1 wi := 2 |rˆi|

30 4.5 Outliers and Influential Cases

A is a particular case with unusual (extreme) value in response and/or expalantory variables. A plot of studentised residuals w.r.t. fitted values can show outliners. A large value of studentised residual q 2 di := ri/ σˆ (1 − Hi,i) that is outside of ±2 band may indicate a case of outlier in response. Recall that the hat matrix H is symmetric and idempotent, so HH = H, and

n X 2 X 2 Hi,i = Hi,jHj,i = Hi,i + Hi,j j=1 j6=i

From this we have X 2 Hi,i(1 − Hi,i) = Hi,j j6=i Since the fitted values can be written as Yˆ = HY ,

n ˆ X X Yi = Hi,jYj = Hi,iYi + Hi,jYj j=1 j6=i

ˆ ˆ This shows when Hi,i → 1 Hi,j → 0 and Yi → Yi, so Yi’s is very much affected by Yi.

Definition

The of the ith observation is Hi,i.

Hi,i is large for the case when (x1, . . . , xp) is far away from the centroid (¯x1,..., x¯p). For example, in simple linear regression, 2 1 (xi − x¯) Hi,i = + n Sxx The sum of all leverage values is

| -1 | | -1 | tr H = tr(X(X X) X ) = tr((X X) X X) = tr Ip+1 = p + 1 so the average leverage is tr H p + 1 h¯ = = n n

Rule of thumb: If Hi,i satisfies 2(p + 1) H > 2h¯ = i,i n then this case has extreme values on explanatory variables and considered a high-leverage case.

4.5.1 Influential Case

Definition

A case influential if removing the observation from such a case will result in dramatically different results of the regression model.

Define the response vector and design matrix without the ith observation to be Y (−i), X(−i). Regressing (−i) on this data generates a estimate β˜ without the ith observation. Let β˜ be the estimator with the ith observation.

31 Definition

(−i) The Cook’s Distance of β˜ and β˜ is

(−i)   (−i) (β˜ − β˜ )| σˆ2(X|X)-1 (β˜ − β˜ ) D := i (p + 1)

4.5.1 Theorem.

2 Hi,idi Di = (1 − Hi,i)(p + 1) where di is the studentised residual.

The simplified formula makes sense since a case is influential only if Hi,i and di are both large.

• If Di > 1 we should be concerned.

• If Di > p/n the case is noteworthy.

• Gaps in Di are interesting.

32 Caput 5

Model/Variable Selection

In many observational studies, there may be a large pool of potential explanatory variables measured along with the response, and the researchers have very little idea of which ones are important factors. We wish to develop a systematic procedure for finding the best model with an appropriate subset of covariates. During such a procedure we may consider many possible models and we strive to achieve the balance between model fit and model simplicity. • Too few covariates: Under-specified model tends to produce biased estimates and predictions.

• Too many covariates: Over-specified models tend to have less precise estimates and predictions.

5.1 Automated Methods

Suppose we start with a full set of q explanatory variables, each may be included or left out in a regression analysis.

1: procedure ForwardSelection(y, {x1, . . . , xq}, α) . α is significance level 2: E ← ∅ . Selected explanatory variables 3: X ← {x1, . . . , xq}. 4: while |X| > 0 5: M ← fit(y, E) 6: for xi ∈ X 7: Mi ← fit(y, E ∪ {xi}) . Fit model on explanatory variables with xi added 8: pi ← test(H0 : xi = 0,Mi,M) . Test the significance of xi 9: end for 10: i ← arg mini pi 11: if pi ≥ α 12: break 13: else 14: E ← E ∪ {xi} 15: X ← X \{xi} 16: end if 17: end while 18: return E 19: end procedure Another method is to drop the highest p-value variable:

1: procedure BackwardElimination(y, {x1, . . . , xq}, α) . α is significance level 2: E ← {x1, . . . , xq} 3: while |E| > 0 4: M ← fit(y, E) 5: i ← arg maxi test(H0 : βi = 0,M) 6: if pi < α

33 7: break 8: else 9: E ← E \{xi} 10: end if 11: end while 12: return E 13: end procedure In general the two methods yield different models.

• ForwardSelection never looks back on selected variables so it may miss many models in the domain. • BackwardElimination never looks back on dropped variables so it may miss many models in the domain. is a combination of backward and forward method. The procedure depends on two significance levels:

• α1: Alpha to enter

• α2: Alpha to drop. At each stage a variable may be added or removed and there are several variants on how this is done.

5.2 All subsets regressions

With a full set of q covariates, there are 2q possible candidate models. In principle we can fit them all and choose the “best” based on fit criteria. Numerical ceriteria of model comparison: • Mean square errors (MSE) An optimum subset model is the one with smallest MSE. • Adjusted R2 2 An optimum subset model is the one with largest Radj. These two criteria are equivalent since

 n − 1   SSR  n − 1 SSE n − 1 R2 = 1 − 1 − = 1 − = 1 − MSE adj n − p − 1 SST n − p − 1 SST SST

Another measure is Mallow’s Cp statistic

SSEp Cp := − (n − 2(p + 1)) MSEfull

A candidate model is good if Cp ≤ p + 1 and generally smaller the better. The Akaike’s Information Criterion is

AIC := −2(`(βˆ) − (p + 1))

Smaller AIC is better model.

34