MSH3 Generalized Linear Model Contents §3 Model Selection 84 §3.1 Deviance for Likelihood Ratio Tests

MSH3 Generalized linear model Ch. 3 Model selection Contents

§3 Model selection 84 §3.1 Deviance for Likelihood Ratio Tests ...... 84 §3.2 Wald Tests ...... 91 §3.3 AIC and BIC ...... 100 §3.4 Checking for normal model ...... 102 §3.5 Multicollinearity - Ridge regression ...... 107 §3.6 Model ﬁtting ...... 113 §3.6.1 For continuous covariates ...... 114 §3.6.2 For categorial covariates ...... 116 §3.6.3 Both continuous and categorical covariates . . . . 123

SydU MSH3 GLM (2015) First semester Dr. J. Chan 83 MSH3 Generalized linear model Ch. 3 Model selection §3 Model selection

§3.1 Deviance for Likelihood Ratio Tests The likelihood ratio (LR) test is based on the comparison of MLs for nested models: a model of interest ω to a saturated model Ω that provides a separate parameter for each observation (µ˜i = yi). ˆ ˜ Let θi and θi denote the parameter estimates under models ω and Ω respectively. The LR test criterion to compare the two models in the exponential family has the form ˆ n ˜ ˆ ˜ ˆ L(θ) X yi(θi − θi) − b(θi) + b(θi) −2 ln λ = −2 ln = 2[` (θ˜)−` (θˆ)] = 2 ˜ Ω ω a (φ) L(θ) i=1 i assuming that ai(φ) = φ/wi (wi = 1) for known prior weights wi. In GLM, the deviance for model ω is deﬁned as: n ˜ ˆ X ˜ ˆ ˜ ˆ D(ω) = 2 a(φ)[`Ω(θ) − `ω(θ)] = 2 [yi(θi − θi) − b(θi) + b(θi)]. i=1 The LR criterion −2 ln λ is the deviance divided by the scale parameter a(φ) or φ called the scaled deviance. To compare two nested models, the log of the ratio of likelihood functions under the two models can be written as a diﬀerence of deviances, since the maximized log-likelihood under the saturated model cancels out. Thus, we have Lω1(θ1) −2 ln λ = −2 ln = 2{[`Ω(θ) − `ω1(θ1)] − [`Ω(θ) − `ω2(θ2)]} Lω2(θ2) D(ω ) − D(ω ) = 1 2 n→∞∼ χ2 . φ p2−p1 The scale parameter φ is either known or estimated using the larger model ω2.

SydU MSH3 GLM (2015) First semester Dr. J. Chan 84 MSH3 Generalized linear model Ch. 3 Model selection The deviances are: For Normal distribution: 1 (y − µ)2 yµ − 1µ2 y2 1 f(y) = √ exp − = exp 2 − − log(2πσ2) . 2πσ2 2σ2 σ2 2σ2 2 1 2 ˜ ˆ ˜ 1 2 We have θ = µ and b(θ) = 2θ . Hence θ = y, θ = µˆ, b(θ) = 2y and ˆ 1 2 b(θ) = 2µ and the deviance is the RSS since 1 1 Deviance = 2[y(θ˜ − θˆ) − b(θ˜) + b(θˆ)] = 2 y(y − µˆ) − y2 + µˆ2 2 2 1 1 = 2 y2 − yµˆ + µˆ2 = (y − µˆ)2. 2 2

For Poisson distribution: θ = ln µ and b(θ) = µ = eθ. Hence θ˜ = ln y, θˆ = ln µ, b(θ˜) = y and b(θˆ) = µ. Deviance = 2[y(θ˜ − θˆ) − b(θ˜) + b(θˆ)] = 2[y ln(y/µ) − (y − µ)]

π For Binomial distribution: θ = ln(1−π ) and b(θ) = −n ln(1 − π). y µ ˜ n ˆ n ˜ y ˆ Hence θ = ln 1− y , θ = ln 1−µ , b(θ) = −n ln(1− n) and b(θ) = −n ln(1− µ n n n). Deviance = 2[y(θ˜ − θˆ) − b(θ˜) + b(θˆ)] y µ n n y µ = 2 y ln y − ln µ + n ln(1 − ) − n ln(1 − ) 1 − n 1 − n n n n − y n − y n − µ = 2 y ln(y/µ) − y ln + n ln − n ln n − µ n n = 2{y ln(y/µ) + (n − y) ln[(n − y)/(n − µ)]} where µ = nπ.

SydU MSH3 GLM (2015) First semester Dr. J. Chan 85 MSH3 Generalized linear model Ch. 3 Model selection

1 −β −1 For Gamma distribution: a(φ) = α, θ = α = µ , α b(θ) = − ln β = − ln(α), where µ = . Hence θ˜ = −1, θˆ = − 1 , µ β y µ ˜ α ˆ α b(θ) = − ln( y ) and b(θ) = − ln(µ). y y Deviance = 2[y(θ˜ − θˆ) − b(θ˜) + b(θˆ)] = 2 − + + ln(α/y) − ln(α/µ) y µ y2 − yµ y − µ = 2 − ln(y/µ) = 2 − ln(y/µ) + yµ µ Note: the canonical link is η = 1/µ and other links such as η = ln µ and η = µγ, γ =6 0 give simpler deviance n X Deviance = 2 ln(µi/yi) i=1 n since P yi−µi = 0 µi i=1 p P Proof: Suppose θi = ηi = ln µi = β1 + βjxij or µi = exp(ηi). j=2 n n n ∂` X ∂` ∂θi ∂µi ∂ηi X yi − µi 1 X yi − µi = = µ ·1 = α = 0 ∂β ∂θ ∂µ ∂η ∂β a(φ) µ2 i µ 1 i=1 i i i 1 i=1 i i=1 i

∂µi 00 2 ∂µi ∂ηi since = b (θi) = µi , = µi and = 1. ∂θi ∂ηi ∂β1 γ Suppose θi = ηi = µi , n n n ∂` X yi − µi ∂µi ∂ηi X yi − µi µi α X yi − µi = = α x = x = 0 ∂β a(φ)b00(θ ) ∂η ∂β µ2 γη ij γ µ η ij j i=1 i i j i=1 i i i=1 i i

SydU MSH3 GLM (2015) First semester Dr. J. Chan 86 MSH3 Generalized linear model Ch. 3 Model selection p ∂ηi γ−1 γηi P since = γµi = . Now since xijβj = ηi, ∂µi µi j=1   n n p p n ! X yi − µi X yi − µi X xijβj X X yi − µi = · = β x = 0 µ µ  η  j µ η ij i=1 i i=1 i j=1 i j=1 i=1 i i The deviances and scaled deviances for the distributions are n n 1 X Normal P(y − µˆ )2 and (y − µˆ )2 i i σ2 i i i=1 i=1 n X yi Poisson 2 [y ln( ) − (y − µˆ )] i µˆ i i i=1 i n X yi ni − yi Binomial 2 y ln + (n − y ) ln i µˆ i i n − µˆ i=1 i i i n X yi yi − µˆi Gamma 2 − ln + and µˆ µˆ i=1 i i n X yi yi − µˆi 2α − ln + µˆ µˆ i=1 i i

n 2 X (yi − µˆi) Inverse Gaussian µˆ2y i=1 i i n X yi yi + r Negative Binomial 2 y ln − (y + r) ln i µ i µ + r i=1 i i

SydU MSH3 GLM (2015) First semester Dr. J. Chan 87 MSH3 Generalized linear model Ch. 3 Model selection

Suppose we consider two linear regression models ω1 and ω2, such that ω1 ⊂ ω2. That means ω1 is a subset or a special case of ω2 by setting some of the parameters in ω2 to zero. 0 0 0 Let β = (β1, β2) in ω2 where β1 is in ω1. We test the hypothesis

H0 : β2 = 0 ⇔ Cβ = 0 under model ω1 for some C. We can partition the design matrix into two components X = (X1, X2) with p1 and p2 predictors respectively. Then β1 Y = [ X1 | X2 ] + = X1β1 + X2β2 + , β2 where X1 is full-ranked n × p1, X2 is full ranked n × p2 (p1 + p2 = p), β1 is p1 × 1 and β2 is p2 × 1. Here we have

1. a smaller model ω1 : Y = X1β1 + (β2 = 0) under H0 with the ﬁrst p1 predictors in X1, and

2. a larger model ω2 : Y = X1β1 +X2β2 + (β2 =6 0) under H1 with all p = p1 + p2 predictors in X. The projection matrix and RSS are

0 −1 0 0 −1 0 H = X(X X) X and H1 = X1(X1X1) X1, and 0 0 RSS(ω2) = Y (I − H)Y and RSS(ω1) = Y (I − H1)Y under models ω2 and ω1 respectively. Note that H is symmetric and H2 = H. The R2 and adjusted R2 are Y 0(H − 110/n)Y n − 1 R2 = and R2 = 1 − (1 − R2) . Y 0(I − 110/n)Y a n − p The ANOVA table is

SydU MSH3 GLM (2015) First semester Dr. J. Chan 88 MSH3 Generalized linear model Ch. 3 Model selection The Hierarchical Anova Table Source of Sum of Degrees of variation squares freedom 0 0 SS(β1) RSS(φ) − RSS(ω1) Y (H1 − 11 /n)Y p1 − 1 0 SS(β2|β1) RSS(ω1) − RSS(ω2) Y (H − H1)Y p2 0 Residual RSS(ω2) Y (I − H)Y n − p Total RSS(φ) Y 0(I − 110/n)Y n − 1 where φ denotes the null model. In particular, if columns in X1 are orthogonal to all columns in X2,

0 −1 0 0 −1 0 βb1 = (X1X1) X1Y , βb2 = (X2X2) X2Y and SS(β2|β1) = SS(β2)

To test H0, the likelihoods under models ωj, j = 1, 2 are compared in the following ratio: L(βb ; y) λ = ω1 , L(βbω2; y) β β ω β β where bωj is the ML estimate of under model j and bω1 = b1. The LR λ is bounded between 0 and 1 (likelihood of smaller model ω1 is smaller). Values close to 0 indicate that the smaller model is not acceptable becuase it would make the observed data very unlikely. Values close to 1 indicate that the smaller model is almost as good as the large model, making the data just as likely. With normal data, the maximized log-likelihoods are n 1 max ln L(β ) = − ln(2πσ2) − RSS(ω )/σ2. ωj 2 2 j βωj Under certain regularity conditions,

RSS(ω1) − RSS(ω2) −2 ln λ = 2 ln L(βb , y) − 2 ln L(βb , y) = . ω2 ω1 σ2

SydU MSH3 GLM (2015) First semester Dr. J. Chan 89 MSH3 Generalized linear model Ch. 3 Model selection Hence for normal data, D(ω) = RSS(ω) and φ = σ2. In the case when σ2 is unknown, we estimate it using the residual sum of squares (RSS) of the larger model ω2 RSS(ω ) σ2 = 2 . b n − p So the criterion becomes RSS(ω ) − RSS(ω ) −2 ln λ = 1 2 n→∞∼ χ2 . p−p1 RSS(ω2)/(n − p) Hence

−2 ln λ [RSS(ω1) − RSS(ω2)]/(p − p1) n→∞ F0 = = ∼ Fp−p1,n−p. p − p1 RSS(ω2)/(n − p)

Note: If X ∼ F , lim ν X ∼ χ2 . ν1,ν2 1 ν1 ν2→∞ 2 Hence, if F0 ∼ Fp ,n−p, lim p2F0 ∼ χp where p2 = p − p1. 2 n→∞ 2 That is, as n → ∞, the degrees of freedom in the denominator approach 2 ∞ and the F converges to (p−p1)χ , so the asymptotic and exact criteria become equivalent.

We reject H0 if the F0 > F0.95; p−p1,n−p.

SydU MSH3 GLM (2015) First semester Dr. J. Chan 90 MSH3 Generalized linear model Ch. 3 Model selection §3.2 Wald Tests

Under certain regularity conditions, the ML estimator θb, in large samples, has approximately a multivariate normal distribution with the mean equal to the true parameter value and the variance-covariance matrix given by the inverse of the information matrix, so that

−1 θb ∼ Np(θ, I (θ)) where ∂2`(θ) I(θ) = var[u(θ)] = −E . y ∂θ ∂θ0 The regularity conditions include: 1. the true parameter value θ must be interior to the parameter space, 2. the log-likelihood function must be three times diﬀerentiable, and 3. the third derivatives must be bounded. This result provides a basis for constructing tests of hypotheses and con- ﬁdence regions. Under the hypothesis

H0 : θ = θ0, the Wald statistic is given by the quadratic form

0 −1 n→∞ 2 W0 = (θb − θ0) [var(θb)] (θb − θ0) ∼ χp has approximately in large samples a chi-squared distribution with p degrees of freedom where p = dim(θ). The variance-covariance matrix, var(θb), can be replaced by the variance- covariance matrix of the ML estimate

−1 −1 var(c θb) = [Ie(θb)] or [Io(θb)] .

SydU MSH3 GLM (2015) First semester Dr. J. Chan 91 MSH3 Generalized linear model Ch. 3 Model selection Example: Wald test in the geometric distribution. Consider again our sample of n = 20 observations from a geometric distribution with sample mean y¯ = 3. The ML estimate was πˆ = 0.25 and Io(πˆ) = 426.67. To test H0 : π = 0.15, the test statistic is

−1 var(c πˆ) = Io(πˆ) = 1/426.67 0 −1 2 W0 = (ˆπ − π0) [var(ˆc π)] (ˆπ − π0) = (0.25 − 0.15) × 426.67 = 4.27 with one degree of freedom. The associated p-value is 0.039, so we would reject H0 at the 5% signiﬁcance level. 2 In particular, if Y ∼ Nn(Xβ, σ I) where β = (β1, β2), the wald statistic for testing H0 : β2 = 0 is 0 W = β [var(β )]−1β n→∞∼ χ2 0 b2 b2 b2 p2 When the subset has only one element, we usually take the square root of the Wald statistic and treat the ratio ˆ θ2 βb2 z = q or q (1) ˆ var(θ2) var(βb2) as a t statistic. Remark:

1. The approximate large sample chi-squared distribution for W0 follows from the following theorem:

Theorem: If Y ∼ Nn(µ, Σ) with Σ being positive definite and X is positive semi-definite with rank(X) = p2, then the random variable Y 0XY ∼ χ2 (1µ0Xµ), a non-central chi-square dist., iff p2 2 (XΣ)2 = XΣ.

For testing H0 : θ = θ0, since θb ∼ Np(θ, var(θb)), setting Y = θb,

SydU MSH3 GLM (2015) First semester Dr. J. Chan 92 MSH3 Generalized linear model Ch. 3 Model selection

µ = θ, Σ = var(θb) and X = var(θb)−1, we have

0 −1 n→∞ 2 1 0 −1 H0 2 1 0 −1 θb [var(θb)] θb ∼ χ ( θ [var(θb)] θ) = χ ( θ [var(θb)] θ0) p 2 p 2 0 iﬀ (XΣ)2 = ([var(θb)]−1[var(θb)])2 = I2 = I = XΣ. 0 −1 n→∞ 2 Subtracting the center, W0 = (θb − θ0) [var(θb)] (θb − θ0) ∼ χp.

Similarly, for testing H0 : β2 = 0, since βb2 ∼ Np2(β2, var(βb2), we have 0 −1 H0 2 1 0 −1 2 Wo = βb [var(βb )] βb ∼ χ 0 [var(βb )] 0 = χ 2 2 2 p2 2 2 p2

0 p 1 2. To test H0 : c β = 0, c ∈ R(X), we use

0 c (βb − β) p 1 √ ∼ tn−p, for some λ (2) σˆ λ0c where rank(X) = p as the test statistics and for constructing conﬁ- dence interval (CI) for c0β.

Proof: The space generated by rows of X and rows of X0X are the same, that is, R(X) = R(X0X). If pc1 ∈ R(X) = R(X0X) is a linear combination of the rows of X or X0X, then c can be written as pc1 =pX0nXpλ1 for some λ. So

1 p 1 c0 βb ∼ N (c0β, σ2λ0c) since

var(c0βb) = c0var(βb)c = c0σ2(X0X)−1c = σ2λ0X0X(X0X)−1c = σ2λ0c.

SydU MSH3 GLM (2015) First semester Dr. J. Chan 93 MSH3 Generalized linear model Ch. 3 Model selection 3. In general, to test H : c0 β = c0 β = ··· = c0 β = 0 or H : c0β = 0, 0 1 2 p2 0 0 where ci ∈ R(X ), c1,..., cp2 are linearly independent and c =

(c1,..., cp2) be a p × p2 matrix (p = p1 + p2), we use the F test: [RSS(ω1) − RSS(ω2)]/p2 [SS(β) − SS(β1)]/p2 F0 = = ∼ Fp2,n−p RSS(ω2)/(n − p) RSS(ω2)/(n − p) since 2 2 0 RSS(ω2) ∼ σ χn−p under the larger model ω2 without c β = 0, RSS(ω ) ∼ σ2χ2 under the smaller model ω with c0β = 0, 1 n−p1 1 2 ! χ /p2 RSS(ω )−RSS(ω ) ∼ σ2χ2 and so F ∼ F F = p2 . 1 2 p2 0 p2,n−p 0 2 χn−p/(n − p)

When n → ∞, σ2 = RSS(ω )/(n − p) → σ2 and F → χ2 /p . b 2 0 p2 2 Thus, in large samples, SS(βb ) SS(βb ) p F = 2 = 2 = W ∼ χ2 2 0 RSS/(n − p) σ2 0 p2 and hence W ∼ χ2 and F = W /p ∼ F are equivalent. 0 p2 0 0 2 p2,n−p Example: (linear regression) Given the model ω2,

ω2 : Yi = β0 + β1xi1 + β2xi2 + β3xi3 + i, (p = 4) or in matrix form Y = X β + ,       1 1 −1 −1 1 1  4   1 1 −1 1          β  2   8   1 −1 1 1  0          3       β1    ⇒  9  =  1 1 1 1    +  4  ,      β2     3   1 0 0 0   5      β3    8   1 0 1 0   6  9 1 0 2 0 7

SydU MSH3 GLM (2015) First semester Dr. J. Chan 94 MSH3 Generalized linear model Ch. 3 Model selection 2 where the columns of X are correspond to 1, X1, X2, X1 . Test the hypothesis H0 : Cβ = 0 or in matrix form       0 0 0 1 β0 0        0 1 −1 0   β1   0      =    0 1 −1 1   β2   0  0 2 −2 3 β3 0

(r3 = r1 + r2 & r4 = 3r1 + 2r2) which implies

H0 : β3 = 0; β1 = β2; β1 + β3 = β2; 2β1 + 3β3 = 2β2.

or H0 : β3 = 0; β1 = β2 = β.

Solution: Under the full model ω2, we have −1    1 1 1  7 0 3 4 2 0 −6 −2  0 4 0 0   0 1 0 0  (X0X)−1 =   =  4  ,  3 0 9 0   −1 0 1 1     6 6 6  1 1 3 4 0 0 4 −2 0 6 4

   11  42 3     0  4  0 −1 0  1  X Y =   , βb = (X X) X Y =   ,  38   3  11 22 6

0 SS(β) = βb X0Y = 312.33, Y 0Y = 316 and 0 0 0 RSS(ω2) = Y Y − βb X Y = 316 − 312.33 = 3.67.

Hence for this smaller model ω1 : Y = β0 + β(X1 + X2) + ei,

E(Y ) = β0+β(X1+X2) = α0+α1Z (p1 = 2) where α0 = β0, α1 = β, 0 0 0 1 and C is reduced , p = rank(C) = 2. 0 1 −1 0 2

SydU MSH3 GLM (2015) First semester Dr. J. Chan 95 MSH3 Generalized linear model Ch. 3 Model selection

  1 −2  1 0     1 0      0 42 Z = [1 x1 + x2] =  1 2  , Z Y = ,   42  1 0     1 1  1 2

−1 7 3 1 13 −3 (Z0Z)−1 = = , 3 13 82 −3 7

21 10 α = (Z0Z)−1Z0Y = , 41 4 21 42 SS(α) = α0Z0Y = [10 4] = 301.17 41 42 0 0 0 RSS(ω1) = Y Y − αb Z Y = 316 − 301.17 = 14.83.

Hence, RSS(ω1) − RSS(ω2) = 14.83 − 3.67 = 11.16 or SS(β|α) = SS(β) − SS(α) = 312.33 − 301.17 = 11.16 at p2 = p − p1 = 4 − 2 = 2 df and n − p = 7 − 4 = 3.

[RSS(ω1) − RSS(ω2)]/p2 11.16/2 F0 = = = 4.56 RSS(ω2)/(n − p) 3.67/(7 − 4)

Since F2,3,0.95 = 9.55, we do not reject H0 which implies that E(Y ) = β0 + β(X1 + X2).

> y=c(1,4,8,9,3,8,9) > x1=c(-1,1,-1,1,0,0,0) > x2=c(-1,-1,1,1,0,1,2) > x3=x1*x1 > X=matrix(0,7,4) > X[,1]=c(rep(1,7)) > X[,2]=x1

SydU MSH3 GLM (2015) First semester Dr. J. Chan 96 MSH3 Generalized linear model Ch. 3 Model selection

> X[,3]=x2 > X[,4]=x3 > X [,1] [,2] [,3] [,4] [1,] 1 -1 -1 1 [2,] 1 1 -1 1 [3,] 1 -1 1 1 [4,] 1 1 1 1 [5,] 1 0 0 0 [6,] 1 0 1 0 [7,] 1 0 2 0 > XX=t(X)%*%X > XX [,1] [,2] [,3] [,4] [1,] 7 0 3 4 [2,] 0 4 0 0 [3,] 3 0 9 0 [4,] 4 0 0 4 > XY=t(X)%*%y > XY [,1] [1,] 42 [2,] 4 [3,] 38 [4,] 22 > XXI=solve(XX) > beta=c(0,0,0,0) > beta=XXI%*%XY > beta [,1] [1,] 3.666667 [2,] 1.000000 [3,] 3.000000 [4,] 1.833333 > SSB=t(beta)%*%XY > SST=t(y)%*%y

SydU MSH3 GLM (2015) First semester Dr. J. Chan 97 MSH3 Generalized linear model Ch. 3 Model selection

> SSR=SST-SSB > c(SSB,SST,SSR) [1] 312.333333 316.000000 3.666667 > Z=matrix(0,7,2) > Z[,1]=c(rep(1,7)) > Z[,2]=x1+x2 > ZZ=t(Z)%*%Z > ZZ [,1] [,2] [1,] 7 3 [2,] 3 13 > ZY=t(Z)%*%y > ZY [,1] [1,] 42 [2,] 42 > ZZI=solve(ZZ) > beta1=c(0,0) > beta1=ZZI%*%ZY > beta1 [,1] [1,] 5.121951 [2,] 2.048780 > SSB1=t(beta1)%*%ZY > SSR1=SST-SSB1 > DIFF=SSB-SSB1 > c(SSB1,SSR1,DIFF) [1] 301.17073 14.82927 11.16260 > F0=(DIFF/2)/(SSR/3) > F0 [,1] [1,] 4.566519

SydU MSH3 GLM (2015) First semester Dr. J. Chan 98 MSH3 Generalized linear model Ch. 3 Model selection 0 4. To test H0 : c β = a where a =6 0, we need to ﬁnd a vector z such that c0z = a. Consider the reparameterized β∗ = β − z, we have c0β∗ = c0β − c0z = a − a = 0.

0 ∗ ∗ The null hypothesis can be written as H0 : c β = 0. Let y = y − Xz such that y∗ − Xβ∗ = y − Xz − X(β − z) = y − Xβ impling (y∗ − Xβ∗)0(y∗ − Xβ∗) = (y − Xβ)0(y − Xβ). Hence RSS∗ is just RSS.

SydU MSH3 GLM (2015) First semester Dr. J. Chan 99 MSH3 Generalized linear model Ch. 3 Model selection §3.3 AIC and BIC Akaike’s information criterion (AIC) is first proposed in Akaike (1974) as a measure of the goodness of fit of a statistical model. It is grounded in the concept of entropy. The AIC is an operational way of trading off the complexity of a model against the goodness of model-fit. Definition: the AIC, in general, is AIC = −2 ln L + 2p where p is the number of parameters in the model, L is the likelihood function and 2p is the penality for parameters. This penalty discourages overfitting. The preferred model is the one with the lowest AIC value. The AIC methodology attempts to find the model that best explains the data with a minimum set of free parameters. AICc is AIC with a second order correction for small sample sizes: 2p(p + 1) AICc = AIC + . n − p − 1 Since it converges to AIC as n gets large, it should be used regardless of sample size (Burnham and Anderson 2004). The AIC penalizes free parameters less strongly than does the Bayesian information criterion (BIC): BIC = −2 ln L + (ln n)p The models being compared need not be nested, unlike the case when models are being compared using F or LR tests. Example: AICs in the geometric distribution with n = 20, p = 1 and

SydU MSH3 GLM (2015) First semester Dr. J. Chan 100 MSH3 Generalized linear model Ch. 3 Model selection ln L = −44.99. AIC = −2 ln L + 2p = −2 × (−44.99) + 2 = 91.96 2p(p + 1) 2 · 1 · (1 + 1) AIC = AIC + = 91.96 + = 92.18 c n − p − 1 20 − 1 − 1 BIC = −2 ln L + (ln n)p = −2 × (−44.99) + ln(20) = 92.96 Example: Mixture and censored models:

AICmix = −2 ln L + 2p = −2 × (−71.91) + 2(3) = 149.82

AICcen = −2 ln L + 2p = −2 × (−53.44) + 2(2) = 110.88

SydU MSH3 GLM (2015) First semester Dr. J. Chan 101 MSH3 Generalized linear model Ch. 3 Model selection §3.4 Checking for normal model Check for the model assumptions of normality, constant variance and independence by

0 1. Normality: the boxplot of ri = yi − xiβb, i = 1, . . . , n is symmetric −1 i and the QQ plot of ri against Φ (n+1) follows a straight line;

2. Constant variance: the residual plot of ri against xi or yi shows no systematic pattern; 3. Independence: residual plot show no systematic patterns. To check for outliers, we look at residuals:

R = Y −Xβb = Y −X(X0X)−1X0Y = [I−X(X0X)−1X0]Y = (I−H)Y where H = X(X0X)−1X0 is the projection matrix and

2 R ∼ Nn(0, σ (I − H)) since E(R) = E[(I − H)Y ] = (I − H)Xβ = Xβ − X(X0X)−1X0Xβ = 0 and Var(R) = (I − H)Var(Y )(I − H) = σ2(I − H)(I − H) = σ2(I − H) since (I − H) is also a projection matrix such that (I − H)2 = I − 2H + H2 = I − H.

2 0 0 −1 Hence Var(ri) = σ (1 − hii), i = 1, . . . , n and hii = xi(X X) xi is called the leverage of the i-th point. In simple regression model, 2 1 (xi − x¯) hii = + P 2 (Proof as exercise.) n j(xj − x¯)

SydU MSH3 GLM (2015) First semester Dr. J. Chan 102 MSH3 Generalized linear model Ch. 3 Model selection n X and Ybj = h1jY1 + h2jY2 + ··· + hnjYn = hijYi since Yb = HY . i=1 Note that trace(H) = the number of variables in the model. Leverage, potential to influence, only depends on the X matrix and is bounded by two limits: 1/n and 1. The closer the leverage is to unity, the more leverage the value has. A point has high leverage (Belsley et al. 1980) if 2p p h > (h¯ = ). ii n n For smaller samples, Vellman and Welsch (1981) suggested that 3p/n is the criterion. If xi is away from x¯, hii will be large and it has a strong effect on the fitted value. Hence points with large leverage are influential points as they draw the fitted line towards them. Note that RSS = R0R = Y 0(I −H)(I −H)Y = Y 0(I −H)Y = Y 0R = R0Y

A. Outlier q qq q q qqq q q q q q q q q B. Influentialq q Outlier and Influential points Point A is an outlier (w.r.t. Y ) as it does not fit into the pattern of the majority of the data. However, it is not influential as its effects is averaged by neighboring points. Point B is not an outlier but it is influential as it is alone in a neighborhood of x. Both will NOT contaminate the regression.

Contamination = Outlying × Inﬂuential

SydU MSH3 GLM (2015) First semester Dr. J. Chan 103 MSH3 Generalized linear model Ch. 3 Model selection Cases:

(a) Outlying but not influential. (b) Influential but not outlying. (c) Outlying and influential. Hence the regression is contaminated.

If an outlier is dropped, the ﬁtted values and βb values may change sub- stantially. One index of detecting single outliers is Cook’s distance:

0 2 Di = (Yb (−i) − Yb ) (Yb (−i) − Yb )/pσˆ 0 0 2 = (βb(−i) − βb) X X(βb(−i) − βb)/(pσˆ ) 2 2 " # ri hii ri hii 1 = 2 2 = 1 (3) (1 − hii) pσˆ σˆ(1 − hii)2 1 − hii p where hii < 1 is assumed, dim(β) = p, Yb = Xβb and Yb (−i) = Xβb(−i) is the ﬁtted values calculated ignoring the i-th data point (xi, yi). The two factors in (3) are the square of the studentized residual and the ratio of variances Var(Ybi)/Var(Ri) respectively which determine the overall impact. The ratio measures the relative sensitivity of βb to each data point. Studentizing the residual is to allow for the fact that var(ˆri) = 2 2 σ (1 − hii) diﬀer, even though var(ri) = σ are all equal. The proof of (3) is given in the Appendix.

SydU MSH3 GLM (2015) First semester Dr. J. Chan 104 MSH3 Generalized linear model Ch. 3 Model selection

It is clear that Di is large if |ri| or hii is large given 0 < hii < 1. A point is classiﬁed as an outlier if

Di > Fp,n−p(0.5), the median of the Fp,n−p distribution. Cook’s distance is only useful in identifying single outliers. It can miss small groups of outlying values. Example: (linear regression) > fit= lm(y ~ x1 + x2 + x3) #follow from previous R program > cooks.distance(fit) 1 2 3 4 5 6 7 0.3068182 0.3068182 0.3068182 0.3068182 0.1818182 0.2727273 0.1818182 > n=dim(X)[1] > p=dim(X)[2] > H=X%*%XXI%*%t(X) > R=y-X%*%beta > R [,1] [1,] -0.5000000 [2,] 0.5000000 [3,] 0.5000000 [4,] -0.5000000 [5,] -0.6666667 [6,] 1.3333333 [7,] -0.6666667 > SSR=t(R)%*%R > sigma2=SSR/3 #n=7, p=4 > sigma2 [,1] [1,] 1.222222 > D=c(rep(0,7)) > for (i in 1:7){ D[i]=R[i]^2*H[i,i]/((1-H[i,i])^2*4*sigma2) } > D [1] 0.3068182 0.3068182 0.3068182 0.3068182 0.1818182 0.2727273 0.1818182 > h=c(rep(0,7))

SydU MSH3 GLM (2015) First semester Dr. J. Chan 105 MSH3 Generalized linear model Ch. 3 Model selection

> for (i in 1:7){ h[i]=H[i,i] } > h [1] 0.6666667 0.6666667 0.6666667 0.6666667 0.5000000 0.3333333 0.5000000 > hlimit=2*p/n > Dlimit=qf(0.5,p,n-p) > c(hlimit,Dlimit) [1] 1.142857 1.063226 Hence there is no inﬂuential point and outlier.

SydU MSH3 GLM (2015) First semester Dr. J. Chan 106 MSH3 Generalized linear model Ch. 3 Model selection §3.5 Multicollinearity - Ridge regression

2 If Y = Xβ + , ∼ Nn(0, σ I), then the estimators βb1,..., βbp are independent iﬀ X0X is diagonal. However if X0X is close to singular then at least one of the eigenvalues of X0X is close to 0. Since V ar(βb) = σ2(X0X)−1, the variance associated wtih the βb can be quite large giving unstable parameter estimates, that is, the estimates βb = (X0X)−1X0Y may not make sense (too large or wrong sign) in terms of the physical situation. This happens when the columns of X are nearly linearly dependent so at least one of the X-variables is redundant. We can proceed by 1. variable selection (dropping one or more variables from the model), 2. collecting additional data, 3. reparametrizing the model, 4. using principal components regression, 5. using the ridge regression to estimate the parameters. Ridge regression was introduced by Hoerl, A.E. and Kennard, R. (Tech- nometrics 1970, 12, 55-67). The ridge estimator is found by solving (X0X + kI)β(k)∗ = X0Y for some constant k > 0. This eﬀectively adds k to all the eigenvalues of X0X. If λ and u are eigenvalue and eigenvector of X0X then (X0X + kI)u = λu + ku = (λ + k)u.

0 If X X has eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λp with corresponding orthog- 0 onal normalized eigenvectors u1, u2,..., up then (X X +kI) has eigenvalues λ1 +k, . . . , λp +k and the same set of eigenvectors u1, u2,..., up.

SydU MSH3 GLM (2015) First semester Dr. J. Chan 107 MSH3 Generalized linear model Ch. 3 Model selection Hence its inverse has eigenvalues 1 ,..., 1 and the same set of eigen- λ1+k λp+k vectors, that is p 0 −1 X −1 0 (X X + kI) = (λi + k) uiui i=1 Implies E(β∗(k)) = (X0X + kI)−1X0Xβ " p #  p  X −1 0 X 0 = (λi + k) uiui  λjujuj  β i=1 j=1 p ! X λi = u u 0 β as u0u = δ λ + k i i i j ij i=1 i where δij = 1 if i = j and 0 otherwise as ui are orthogonal. Hence β∗(k)) is biased. var(β∗(k)) = σ2(X0X + kI)−1X0X(X0X + kI)−1 p X λi = σ2 u u0 → 0 as k → ∞. (λ + k)2 i i i=1 i As k increases, β∗(k) becomes smaller in absolute value and tends to zero as k tends to inﬁnity. Although β∗(k) is biased, it provides smaller overall mean square error as the reduction in variance error will compensate for the bias. Note that β∗(k) = (X0X + kI)−1(X0Y ) E(β∗(k)) = (X0X + kI)−1(X0X)β =[ I + k(X0X)−1]−1β = I − k[X0X + kI]−1 β (4) = β − k[X0X + kI]−1β

SydU MSH3 GLM (2015) First semester Dr. J. Chan 108 MSH3 Generalized linear model Ch. 3 Model selection Hence the bias is −k[X0X+kI]−1β. The proof of (4) is in the Appendix. MSE(β∗) = E[(β∗ − β)0(β∗ − β)] = E{[β∗ − E(β∗)]0(β∗ − E(β∗)]} +[ E(β∗) − β]0[E(β∗) − β] p p X X γ2 = Var(β∗) + k2 i where γ = u0β i (λ + k)2 i i i=1 i=1 i p p 2 X λi X γ = σ2 + k2 i (λ + k)2 (λ + k)2 i=1 i i=1 i p 2 X −1 = σ λi on setting k = 0 i=1 Theorem: There always exists a k > 0 such that

MSE(β∗(k)) < MSE(βb). Proof: Since MSE(β∗(0)) = MSE(βb), result follows if MSE(β∗(k)) is decreasing in k for k > 0, that is MSE0(β∗(k)) < 0 for k > 0 near 0.

p p 2 2 2 X −2λi X γ [2k(λi + k) − 2k (λi + k)] MSE0(β∗(k)) = σ2 + i (λ + k)3 (λ + k)4 i=1 i i=1 i p p 2 X −2λi X γ 2kλi = σ2 + i (λ + k)3 (λ + k)3 i=1 i i=1 i p 2 2 X 2λi(kγ − σ ) = i (λ + k)3 i=1 i σ2 σ2 < 0 if k < 2 ⇒ k < 2 (5) γi max γi i=1,...,p How to choose k?

SydU MSH3 GLM (2015) First semester Dr. J. Chan 109 MSH3 Generalized linear model Ch. 3 Model selection 2 The average of γi is p p 1 X 1 X γ2 = β0u u0β = β0β/p p i p i i i=1 i=1 Pp 0 since the matrix i=1 uiui changes the ‘direction’ but not the magnitude β0β which is the square of length kβk. So from (5), a reasonable ﬁrst guess for k is 2 2 ˆ σˆ pσˆ k = p = 0 1 P 2 βb βb p γi i=1 where σˆ2 is the residual mean square and kˆ is optimal in some circum- stances. An alternative way is to plot β∗(k) vs k and choose k so that the estimates are fairly stable and with right signs. These plots are called Ridge Traces. Once k is large enough, the system will behave like an orthogonal system.

> p=dim(X)[2] > cor(X) [,1] [,2] [,3] [,4] [1,] 1 NA NA NA [2,] NA 1 0.0000000 0.0000000 [3,] NA 0 1.0000000 -0.4714045 [4,] NA 0 -0.4714045 1.0000000 Warning message: the standard deviation is zero in: cor(x,y,na.method,method=="kendall") > out=matrix(0,51,5) > for (i in 1:51){ + k=i-1 + XXk=t(X)%*%X+diag(rep(k,p)) + betah=solve(XXk)%*%XY + out[i,]=c(k,betah[1],betah[2],betah[3],betah[4]) + } >

SydU MSH3 GLM (2015) First semester Dr. J. Chan 110 MSH3 Generalized linear model Ch. 3 Model selection

> out [,1] [,2] [,3] [,4] [,5] [1,] 0 3.6666667 1.00000000 3.0000000 1.8333333 [2,] 1 3.3333333 0.80000000 2.8000000 1.7333333 [3,] 2 3.0769231 0.66666667 2.6153846 1.6153846 ... [49,] 48 0.7024499 0.07692308 0.6296956 0.3690423 [50,] 49 0.6908908 0.07547170 0.6194367 0.3629516 [51,] 50 0.6797061 0.07407407 0.6095065 0.3570588 > par(mfrow=c(2,2)) > plot(out[,1],out[,2],xlab="k",ylab="beta*",ylim=c(0,4), xlim=c(0,51),pch=20,col="green") > points(out[,1],out[,3],pch=20,col="blue") > points(out[,1],out[,4],pch=20,col="red") > title("Ridge Traces") > khat=p*sigma2/(t(beta)%*%beta) > khat [,1] [1,] 0.1823834 > ridge1 <- ridge(X,y,k=khat) > ridge1$coefficients [,1] [1,] 3.5950583 [2,] 0.9563925 [3,] 2.9638084 [4,] 1.8218719

SydU MSH3 GLM (2015) First semester Dr. J. Chan 111 MSH3 Generalized linear model Ch. 3 Model selection

Ridge Traces beta* 0 1 2 3 4

0 10 20 30 40 50

k 0 Shrinkage towards 0: β∗0β∗ ≤ βb βb since β∗ = (X0X + kI)−1X0Y = (X0X + kI)−1X0Xβb 0 0 β∗ β∗ = βb X0X(X0X + kI)−2X0Xβb p 2 0 X λi 0 = βb uiu βb since ui are orthogonal λ + k i i=1 i 2 p 0 λ1 X 0 ≤ βb βb uiu as λ1 ≥ λ2 ≥ · · · ≥ λp λ + k i 1 i=1 2 0 λ1 ≤ βb βb λ1 + k 0 ≤ βb βb

SydU MSH3 GLM (2015) First semester Dr. J. Chan 112 MSH3 Generalized linear model Ch. 3 Model selection §3.6 Model fitting Example: (Program effort) The data include for 20 countries in Latin America and the Caribbeanan 1. Outcomes: the percent decline in crude birth rate (CBR) between 1965 and 1975. 2. Predictor: index of social setting that includes 7 social indicators: literacy, school enrollment, life expectancy, infant mortality, percent of males aged 15- 64 in the non-agricultural labor force, gross national product per capita and percent of population living in urban areas; higher scores represent higher socio-economic levels, 3. Predictor: index of family planning effort that includes 15 program indicators, including the existence of an offical family planning policy, the avail- ability of contraceptive methods, and the structure of the family planning program; 0 denotes the absence of a program, 1-9 indicates weak programs, 10-19 represents moderate efforts and 20 or more denotes fairly strong programs.

The index of family planning effort combines . Table: The Program effort data Countries Social Fam.-plan CBR Countries Social Fam.-plan CBR setting effort decline setting effort decline Bolivia 46 0 1 Haiti 35 3 0 Brazil 74 0 10 Honduras 51 7 7 Chile 89 16 29 Jamaica 87 23 21 Colombia 77 16 25 Mexico 83 4 9 CostaRica 84 21 29 Nicaragua 68 0 7 Cuba 89 15 40 Panama 84 19 22 Dominican Rep 68 14 21 Paraguay 74 3 6 Ecuador 70 6 0 Peru 73 0 2 El Salvador 60 13 13 Trinidad-Tobago 84 15 29 Guatemala 55 9 4 Venezuela 91 7 11

SydU MSH3 GLM (2015) First semester Dr. J. Chan 113 MSH3 Generalized linear model Ch. 3 Model selection

Types of mean function µi:

§3.6.1 For continuous covariates 1. Multiple regression model:

µi = α + β1xi1 + β2xi2 + ··· + βpxip It adjusts for the confounding eﬀects of some covariates. 2. Higher order polynomial terms:

2 p µi = α + β1xi + β2xi + ··· + βpxi 2 or centering the variables using (xi − x¯i) to reduce correlation between X and X2. 3. Interaction (cross-product) iterm:

µi = α + β1x1i + β2x2i + β12x1ix2i After rearranging,

µi = (α + β1x1i) + (β2 + β12x1i)x2i.

So both the intercept and the slope (the effect of X2 on Y ) are linear functions of X1. The slope starts from a baseline effect of β2 when X1 is zero, and has an additional effect of β12 units for each unit increase in X1.

SydU MSH3 GLM (2015) First semester Dr. J. Chan 114 MSH3 Generalized linear model Ch. 3 Model selection Example: (Program eﬀort) 1. Simple and partial correlations:

To ﬁnd the partial correlation between Y and X2 adjusting for x1, calculate the residuals when Y is regressed on X1 and when X2 is regressed on X1. The partial correlation coeﬃcient is the Pearson’s r between the two sets of residuals: ry2 − ry1r12 ry2.1 = q 2 p 2 1 − ry1 1 − r12

Table: Simple and partial correlations of CBR decline data Predictor Simple Partial Social setting 0.673 0.519 Family planning effort 0.801 0.722 The effect of effort is more pronounced and more resilient to adjustment than the effect of setting. > cor(CBR,social) [1] 0.6732032 > cor(CBR,family) [1] 0.80083 > pcor <- function(v1, v2, v3) #corr between v1 & v2 adjust for v3 + { + c12 <- cor(v1,v2) + c23 <- cor(v2,v3) + c13 <- cor(v1,v3) + partial <- (c12-(c13*c23))/(sqrt(1-(c13^2)) * sqrt(1-(c23^2))) + return(partial) + } > pcor(CBR,social,family) [1] 0.5195105 > pcor(CBR,family,social) [1] 0.7218626

SydU MSH3 GLM (2015) First semester Dr. J. Chan 115 MSH3 Generalized linear model Ch. 3 Model selection

§3.6.2 For categorial covariates 1. One-way ANOVA model: X µij = µ + αi, αi = 0 or α1 = 0 i

for the j-th outcome in level i of a factor and αi denotes such eﬀect.

Constraints: if we set α1 = 0, the first level is chosen as a reference. Thenµ ˆ =y ¯1· andα ˆj =y ¯j· − y¯1· . P If j αj = 0, µ is overall expected response and αj is the difference of level j from the overall mean. Thenµ ˆ =y ¯·· andα î =y ¯i· −y¯··. 2. Two-way ANOVA model:

µij = µ + αi + βj

where µ is a baseline value, αi is the eﬀect of the level i of the row factor and βj is the eﬀect of the level j of the column factor.

Constraints: if we set α1 = β1 = 0 then µˆ =y ¯11·,α ˆi =y ¯i1· − y¯11· and βbj =y ¯1j· − y¯11·. Any cell (i, j) other than (1,1) can be a reference cell. Table: The Two-Factor Additive Model Row Column 1 2 3 1 µ µ + β2 µ + β3 2 µ + α2 µ + α2 + β2 µ + α2 + β3 3 µ + α3 µ + α3 + β2 µ + α3 + β3 In matrix form, the n × [(R − 1)(C − 1) + 1] design matrix X (R = C = 3) where vectors of 1 and 0 are vectors of 1 and 0

SydU MSH3 GLM (2015) First semester Dr. J. Chan 116 MSH3 Generalized linear model Ch. 3 Model selection P respectively of dimension nij ( ij nij = n) is     µ11 1 0 0 0 0  µ   1 0 0 1 0   12       µ   1 0 0 0 1  µ  13     µ   1 1 0 0 0   α   21     2        µ = Xβ ⇒  µ22  =  1 1 0 1 0   α3         µ23   1 1 0 0 1   β2       µ31   1 0 1 0 0  β3      µ32   1 0 1 1 0  µ33 1 0 1 0 1

Example: (Program eﬀort)

Social setting classes: low (< 70), medium (70-79) & high (≥ 80) Family planning eﬀort classes: weak (0-4), moderate (5-14) & strong (≥ 15).

Table: CBR decline by social setting and family planning eﬀort

Setting Eﬀort ni. Weak Moderate Strong Low 1,0,7 21,13,4,7 - 7 Medium 10,6,2 0 25 5 High 9 11 29,29,40,21,22,29 8 n.j 7 6 7 20 We ﬁt the two factors ANOVA model:

2 µij = µ + αi + βj,Yijk ∼ N (µij, σ )

> dat=read.csv("data/CBR.csv") > attach(dat) > summary(lm(CBR~socialc+familyc))

SydU MSH3 GLM (2015) First semester Dr. J. Chan 117 MSH3 Generalized linear model Ch. 3 Model selection

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 11.603 4.270 2.717 0.01590 * socialclow -2.388 4.457 -0.536 0.59999 socialcmedium -4.068 4.216 -0.965 0.34989 familycstrong 16.835 4.588 3.669 0.00228 ** familycweak -3.836 3.575 -1.073 0.30014 --- Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 6.188 on 15 degrees of freedom Multiple R-Squared: 0.7833, Adjusted R-squared: 0.7255 F-statistic: 13.55 on 4 and 15 DF, p-value: 7.19e-05 The parameters are ∗ ∗ µ32 = µ+α3 +β2 = 11.603, α1 = α1 −α3 = −2.388, α2 = α2 −α3 = ∗ ∗ −4.068, β3 = β3 − β2 = 16.835 and β1 = β1 − β2 = −3.836. (a) Fitted means after adjusting for the other factors Table: Fitted means based on two-factor additive model Setting Effort All Weak Moderate Strong ∗ ∗ ∗ ∗ ∗ Low 5.38 (µ32 + α1 + β1) 9.22 (µ32 + α1) 26.05 (µ32 + α1 + β3) 13.77 ∗ ∗ ∗ ∗ ∗ Medium 3.70 (µ32 + α2 + β1) 7.54 (µ32 + α2) 24.37 (µ32 + α2 + β3) 12.08 ∗ ∗ High 7.77 (µ32 + β1) 11.60 (µ32) 28.44 (µ32 + β3) 16.15 All 5.91 9.75 26.59 14.30 The column and row means are weighted averages respectively: X X µˆ.j = ni.µîj/n and µî. = n.jµîj/n i j

E.g.ˆµ.1 = [5.38(7) + 3.7(5) + 7.77(8)]/20 = 5.91.

SydU MSH3 GLM (2015) First semester Dr. J. Chan 118 MSH3 Generalized linear model Ch. 3 Model selection Table: CBR Decline by family planning eﬀort before and after adjustment for social setting Eﬀort CBR Decline Unadjusted Adjusted Weak 5.00 5.91 Moderate 9.33 9.75 Strong 27.86 26.59

> summary(lm(CBR~familyc)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.333 2.446 3.816 0.00138 ** familycstrong 18.524 3.333 5.557 3.47e-05 *** familycweak -4.333 3.333 -1.300 0.21093 --- Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 5.991 on 17 degrees of freedom Multiple R-Squared: 0.7698, Adjusted R-squared: 0.7427 F-statistic: 28.42 on 2 and 17 DF, p-value: 3.790e-06

SydU MSH3 GLM (2015) First semester Dr. J. Chan 119 MSH3 Generalized linear model Ch. 3 Model selection 3. ANOVA model with interactions:

µij = µ + αi + βj + γij.

The term γij is an interaction eﬀect. With R rows and C columns for factors, we introduce 1 + (R − 1) + (C − 1) + (R − 1)(C − 1) parameters.

Constraints: if we set all parameters in the ﬁrst row or the ﬁrst column to zero, i.e. α1 = β1 = γ1j = γi1 = 0. The structure of the means in a three by three layout is Table: Two-factor model with interactions Row Column 1 2 3 1 µ µ + β2 µ + β3 2 µ + α2 µ + α2 + β2 + γ22 µ + α2 + β3 + γ23 3 µ + α3 µ + α3 + β2 + γ32 µ + α3 + β3 + γ33

The interaction term γij is the additional eﬀect of level i of the row factor, compared to level 1, when the column factor is at level j rather than 1. Then

µˆ = y11, αˆ2 = y21 − y11, αˆ3 = y31 − y11, βb2 = y12 − y11, βb2 = y13 − y11, γb22 = y22 − y12 − y21 + y11, γb23 = y23 − y13 − y21 + y11 γb32 = y32 − y12 − y31 + y11 and γb33 = y33 − y13 − y31 + y11. The design matrix X has a set of additional (R − 1)(C − 1) dummy variables to represent the interactions. In matrix form, the n × RC design matrix X (R = C = 3) where vectors of 1 and 0 are vectors

SydU MSH3 GLM (2015) First semester Dr. J. Chan 120 MSH3 Generalized linear model Ch. 3 Model selection

of 1 and 0 respectively of dimension nij.       µ11 1 0 0 0 0 0 0 0 0 µ  µ   1 0 0 1 0 0 0 0 0   α   12     2   µ   1 0 0 0 1 0 0 0 0   α   13     3   µ   1 1 0 0 0 0 0 0 0   β   21     2        µ = Xβ ⇒  µ22  =  1 1 0 1 0 1 0 0 0   β3         µ23   1 1 0 0 1 0 1 0 0   γ22         µ31   1 0 1 0 0 0 0 0 0   γ23         µ32   1 0 1 1 0 0 0 1 0   γ32  µ33 1 0 1 0 1 0 0 0 1 γ33

P P P If, on the other hand, we set i αi = 0, j βi = 0, i γij = 0, P j γij = 0 then

µˆ = y¯···, αbi = y¯i·· −y¯···, βbj = y¯·j· −y¯···, γbij = y¯ij· −y¯i·· −y¯·j· +y¯···

SydU MSH3 GLM (2015) First semester Dr. J. Chan 121 MSH3 Generalized linear model Ch. 3 Model selection Example: (Program eﬀort)

> summary(lm(CBR~socialc*familyc)) Coefficients: (1 not defined because of singularities) Estimate Std. Error t value Pr(>|t|)

(intercept) 11.000 6.196 1.775 0.1012 socialclow 0.250 6.928 0.036 0.9718 socialcmedium -11.000 8.763 -1.255 0.2333 familycstrong 17.333 6.693 2.590 0.0237 * familycweak -2.000 8.763 -0.228 0.8233 socialclow:familycstrong NA NA NA NA socialcmedium:familycstrong 7.667 11.027 0.695 0.5001 socialclow:familycweak -6.583 9.959 -0.661 0.5211 socialcmedium:familycweak 8.000 11.313 0.707 0.4930 --- Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 6.196 on 12 degrees of freedom Multiple R-Squared: 0.8261, Adjusted R-squared: 0.7247 F-statistic: 8.146 on 7 and 12 DF, p-value: 0.0009208

> summary(aov(CBR~socialc*familyc)) Df Sum Sq Mean Sq F value Pr(>F) socialc 2 1193.79 596.89 15.5458 0.0004664 *** familyc 2 882.02 441.01 11.4859 0.0016322 ** socialc:familyc 3 113.64 37.88 0.9866 0.4317952 Residuals 12 460.75 38.40 --- Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

For the empty cell of (2,2), we set γ22 = 0. The final model has 8 parameters: the constant, 2 setting effects, 2 effort effects, and 3 (not 4) interaction terms.

To test for the interaction, F0 = 0.987 on 3 and 12 d.f. and is clearly not signiﬁcant.

SydU MSH3 GLM (2015) First semester Dr. J. Chan 122 MSH3 Generalized linear model Ch. 3 Model selection

§3.6.3 Both continuous and categorical covariates Some predictors are continuous and some discrete factors. 1. ANCOVA model without interaction:

µij = µ + αi + βxij

where j = 1, . . . , ni and i = 1, . . . , k. We add the constraint α1 = 0, so again µ becomes the intercept for the reference cell, and αi becomes the diﬀerence in intercepts between levels i and 1.

This model deﬁnes a series of straight-line regressions, one for each level of the discrete factor and each with diﬀerent intercepts µ + αi, but a common slope β, so they are parallel.

In matrix form, the n × 4 design matrix X where vectors of 1, 0 and xi are vectors of 1, 0 and xij respectively of dimension ni P ( i ni = n) is   µ  µ   1 0 0 x  1 1  α  µ = Xβ ⇒ µ = 1 1 0 x  2   2   2   α  µ 1 0 1 x  3  3 3 β The fundamental diﬀerence between the two approaches of setting continuous and categorical data lies on the assumption of linearity. If the assumption of linearity fails, we should use transformations or higher-order polynomial terms, resulting in models which are often harder to interpret.

On the other hand, by grouping the predictor into categories, it is easier to interpret but the information in the data is not fully utilized. In the example, social setting explained 45% of the variation in CBR declines when treated as a variate and as a factor, suggesting linearity

SydU MSH3 GLM (2015) First semester Dr. J. Chan 123 MSH3 Generalized linear model Ch. 3 Model selection of the data. However family planning eﬀort explained 64% and 77% respectively when treated as a variate and as a factor, suggesting non-linearity: CBR decline changes little from weak to moderate programs, but raises steeply for strong programs.

> summary(lm(CBR~social)) ... Residual standard error: 8.973 on 18 degrees of freedom Multiple R-squared: 0.4532, Adjusted R-squared: 0.4228 F-statistic: 14.92 on 1 and 18 DF, p-value: 0.001141

> summary(lm(CBR~socialc)) ... Residual standard error: 9.256 on 17 degrees of freedom Multiple R-squared: 0.4505, Adjusted R-squared: 0.3858 F-statistic: 6.967 on 2 and 17 DF, p-value: 0.006167

> summary(lm(CBR~family)) ... Residual standard error: 7.267 on 18 degrees of freedom Multiple R-squared: 0.6413, Adjusted R-squared: 0.6214 F-statistic: 32.19 on 1 and 18 DF, p-value: 2.216e-05

> summary(lm(CBR~familyc)) ... Residual standard error: 5.991 on 17 degrees of freedom Multiple R-squared: 0.7698, Adjusted R-squared: 0.7427 F-statistic: 28.42 on 2 and 17 DF, p-value: 3.79e-06

SydU MSH3 GLM (2015) First semester Dr. J. Chan 124 MSH3 Generalized linear model Ch. 3 Model selection Example: (Program eﬀort)

Table: CBR declines and social setting scores by family planning eﬀort Family Planning Eﬀort Weak Moderate Strong Setting CBR Setting CBR Setting CBR 46 1 68 21 89 29 74 10 70 0 77 25 35 0 60 13 84 29 83 9 55 4 89 40 68 7 51 7 87 21 74 6 91 11 84 22 72 2 84 29

(a) Model ﬁt: > summary(lm(CBR~social+familyc)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.8101 7.3322 -0.247 0.808146 social 0.1693 0.1056 1.604 0.128343 familycstrong 15.3037 3.7685 4.061 0.000908 *** familycweak -4.1439 3.1912 -1.299 0.212502 --- Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 5.732 on 16 degrees of freedom Multiple R-Squared: 0.8016, Adjusted R-squared: 0.7644 F-statistic: 21.55 on 3 and 16 DF, p-value: 7.262e-06

There is a further 0.17 % decline in CBR for each unit increase in social setting for any level of family-planning. Countries with moderate and strong programs show additional CBR declines of 4 and 19 (15.3+4.1) percentage points, respectively, compared to countries with weak programs at the same social setting.

SydU MSH3 GLM (2015) First semester Dr. J. Chan 125 MSH3 Generalized linear model Ch. 3 Model selection

s ss s m ss m w w m CBR decline m w m w w w 0 10 20w 30 40 m

30 50 70 90

social Plotted CBR decline as a function of social setting for weak, moderate and strong programs. > par=lm(CBR~social+familyc)$coeff > names(par) = NULL > par [1] -1.8101209 0.1692677 15.3036938 -4.1439148 > mu=par[1] > beta=par[2] > alpha1=par[4] > alpha2=0 > alpha3=par[3] > CBR1=CBR[familyc=="w"] > CBR2=CBR[familyc=="m"] > CBR3=CBR[familyc=="s"] > social1=social[familyc=="w"] > social2=social[familyc=="m"] > social3=social[familyc=="s"] > par(mfrow=c(2,2)) > plot(social1, CBR1, xlab="social", ylab="CBR decline", ylim=c(-2,42), xlim=c(30,102), pch="w", col="green") > points(social2, CBR2, pch="m", col="blue") > points(social3, CBR3, pch="s", col="red") > abline(mu+alpha1, beta, col="green") > abline(mu+alpha2, beta, col="blue") > abline(mu+alpha3, beta, col="red")

SydU MSH3 GLM (2015) First semester Dr. J. Chan 126 MSH3 Generalized linear model Ch. 3 Model selection 2. ANCOVA model with interactions: To test the assumption of equal slopes, we consider a more general model

µij = µ + αi + βxij + γixij = (µ + αi) + (β + γi)xij

where each level has its own intercept µ + αi and slope β + γi. The reference cell constraints are to set α1 = γ1 = 0. As a result, µ is the constant, β is the slope for the reference cell and αi and γi are the diﬀerences in intercept and slope, respectively, between level i and 1 of the discrete factor when the covariate xij is zero.

SydU MSH3 GLM (2015) First semester Dr. J. Chan 127 MSH3 Generalized linear model Ch. 3 Model selection Example: (Program eﬀort)

(a) Model ﬁt > summary(lm(CBR~social*familyc)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.96478 12.48141 0.238 0.816 social 0.09674 0.18596 0.520 0.611 familycstrong -29.43978 51.44526 -0.572 0.576 familycweak -9.84465 15.57524 -0.632 0.538 social:familycstrong 0.54354 0.61627 0.882 0.393 social:familycweak 0.08684 0.23258 0.373 0.714

Residual standard error: 5.959 on 14 degrees of freedom Multiple R-Squared: 0.8124, Adjusted R-squared: 0.7454 F-statistic: 12.13 on 5 and 14 DF, p-value: 0.0001112

The effect of social setting (slope) is nearly the same for countries with weak and moderate programs (γ21 = −0.09), but is more pronounced in countries with strong programs (γ31 = 0.54 − 0.09 = 0.45; higher slope). The slopes are 0.19 (0.10+0.09), 0.10 and 0.64 (0.10+0.54) respectively for weak, moderate and strong family planning effort. The intercepts are -6.88 (2.96-9.84), 2.96 and -26.48 (2.96-29.44). Moreover, the effect of strong programs compared to weak ones is more pronounced at higher levels of social setting. E.g. strong programs show 13.57 [(-26.48+6.88)+72.1(0.64-0.19)] percentage points in CBR decline than weak programs at an average level of social setting (72.1). (b) ANCOVA table and F test: The t ratios suggest that none of these interactions is significant. To test the hypothesis of parallelism (or no interaction) we need

SydU MSH3 GLM (2015) First semester Dr. J. Chan 128 MSH3 Generalized linear model Ch. 3 Model selection to consider the joint significance of the two coefficients represent- ing differences in slopes, i.e. we need to test H0 : α2 = α3 = 0. This is easily done comparing this model which has a RSS of 497.1 on 14 d.f. with the parallel lines model which had a RSS of 525.7 on 16 d.f.. Hence 28.6/2 F = = 0.402736 < 3.7389 = F 0 497.1/14 0.95,2,14 Hence the interaction is not significant. > summary(aov(CBR~social+familyc)) Df Sum Sq Mean Sq F value Pr(>F) social 1 1201.08 1201.08 36.556 1.698e-05 *** familyc 2 923.43 461.71 14.053 0.0002999 *** Residuals 16 525.69 32.86 --- Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 > > summary(aov(CBR~social*familyc)) Df Sum Sq Mean Sq F value Pr(>F) social 1 1201.08 1201.08 33.8263 4.476e-05 *** familyc 2 923.43 461.71 13.0034 0.0006426 *** social:familyc 2 28.59 14.30 0.4026 0.6760530 Residuals 14 497.10 35.51 --- Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

s s

s ss s ss s s m ss m ss m m w w m w w m CBR decline m w CBR decline m w m w m w w w w w 0 10 20w 30 40 m 0 10 20w 30 40 m

30 50 70 90 30 50 70 90

social social

SydU MSH3 GLM (2015) First semester Dr. J. Chan 129 MSH3 Generalized linear model Ch. 3 Model selection

> par=lm(CBR~social*familyc)$coeff > names(par) = NULL > par [1] 2.96477845 0.09673754 -29.43977845 -9.84464654 0.54354024 [6] 0.08683658 > mu=par[1] > beta=par[2] > alpha1=par[4] > alpha2=0 > alpha3=par[3] > gamma1=par[6] > gamma2=0 > gamma3=par[5] > plot(social1, CBR1, xlab="social", ylab="CBR decline", ylim=c(-2,42), xlim=c(30,102), pch="w", col="green") > points(social2, CBR2, pch="m", col="blue") > points(social3, CBR3, pch="s", col="red") > abline(mu+alpha1, beta+gamma1, col="green") > abline(mu+alpha2, beta+gamma2, col="blue") > abline(mu+alpha3, beta+gamma3, col="red")

Summary of mean functions Type of term Algebraic Model formula term Continuous covariate βx X Factor αi A Mixed αix A.X Compound (αγ)ij A.B Compound mixed (αγ)ijx A.B.X

SydU MSH3 GLM (2015) First semester Dr. J. Chan 130 MSH3 Generalized linear model Ch. 3 Model selection The following are expressions for diﬀerent models: A ∗ B ∗ C = A + B + C + A · B + A · C + B · C + A · B · C A ∗ B + C = A + B + C + A · B A ∗ B.C = A + B · C + A · B · C A ∗ (B + C) = A ∗ B + A ∗ C = A + B + A · B + C + A · C A/B = A + A · B (A ∗ B)/C = A ∗ B + A.B.C = A + B + A · B + A · B · C

SydU MSH3 GLM (2015) First semester Dr. J. Chan 131 MSH3 Generalized linear model Ch. 3 Model selection Appendix 0 −1 0 0 −1 0 A. Let βb(−i) = (X(−i)X(−i)) (X(−i)Y (−i)) and βb = (X X) (X Y ) 0 0 where X(−i) and Y (−i) are matrix and vector removing the i-th row xi 0 from X and i-th entry yi from Y . We assume that X(−i) has full rank p. Then

0 −1 0 0 0 −1 0 βb = (X X) (X Y ) =( X(−i)X(−i) + xixi) (X(−i)Y (−i) + xiyi) 0 −1 0 −1 0 0 −1 0 =[( X(−i)X(−i)) − (X(−i)X(−i)) xixi(X(−i)X(−i)) /(1 + c)](X(−i)Y (−i) + xiyi) 0 −1 0 0 −1 = (X(−i)X(−i)) X(−i)Y (−i) + (X(−i)X(−i)) xiyi 0 −1 0 0 −1 0 −(X(−i)X(−i)) xixi(X(−i)X(−i)) X(−i)Y (−i)/(1 + c) 0 −1 0 0 −1 −(X(−i)X(−i)) xixi(X(−i)X(−i)) xiyi/(1 + c) 0 −1 0 = βb(−i) − (X(−i)X(−i)) xixiβb(−i)/(1 + c) 0 −1 0 −1 +(X(−i)X(−i)) xiyi − (X(−i)X(−i)) xicyi/(1 + c) 0 −1 0 0 −1 = [I − (X(−i)X(−i)) xixi/(1 + c)]βb(−i) + (X(−i)X(−i)) xiyi/(1 + c) 0 0 −1 where c = xi(X(−i)X(−i)) xi using

[A + UV 0]−1 = A−1 − (A−1UV 0A−1)/(1 + V 0A−1U). (6) Then

0 0 0 −1 0 0 0 −1 xiβb = xi[I − (X(−i)X(−i)) xixi/(1 + c)]βb(−i)+xi(X(−i)X(−i)) xiyi/(1 + c) 0 0 −1 = [xi − cxi/(1 + c)]βb(−i) + c(1 + c) yi −1 0 −1 = (1 + c) xiβb(−i) + c(1 + c) yi 0 −1 0 −1 yi − xiβb = yi − (1 + c) xiβb(−i) − c(1 + c) yi −1 −1 0 −1 0 = (1 + c) yi − (1 + c) xiβb(−i) = (1 + c) (yi − xiβb(−i)) 0 0 0 0 0 xi(βb − βb(−i)) = (yi − xiβb(−i)) − (yi − xiβb) = (1 + c)(yi − xiβb) − (yi − xiβb) 0 0 0 −1 0 = c(yi − xiβb) = xi(X(−i)X(−i)) xi(yi − xiβb)

SydU MSH3 GLM (2015) First semester Dr. J. Chan 132 MSH3 Generalized linear model Ch. 3 Model selection Hence

0 −1 0 0 0 −1 0 βb − βb(−i) = (X(−i)X(−i)) xi(yi − xiβb) =( X X − xixi) xi(yi − xiβb) 0 −1 0 −1 0 0 −1 0 0 −1 0 =[( X X) + (X X) xixi(X X) /(1 − xi(X X) xi)]xi(yi − xiβb) 0 −1 0 −1 0 = [(X X) xi + (X X) xihii/(1 − hii)](yi − xiβb) 0 −1 −1 0 = (X X) xi(1 − hii) (yi − xiβb)

using (6). Hence the Cook’s distance is

0 0 2 Di = (βb(−i) − βb) X X(βb(−i) − βb)/(pσˆ ) 0 0 −1 0 0 −1 0 xi(X X) (X X)(X X) xi 0 = (yi − xiβb) 2 2 (yi − xiβb) (1 − hii) (pσˆ ) 2 0 2 hii ri hii = (yi − xiβb) 2 2 = 2 2 (1 − hii) (pσˆ ) (1 − hii) pσˆ " #2 ri hii 1 = 1 σˆ(1 − hii)2 1 − hii p B. To show [I + k(X0X)−1]−1 = I − k[X0X + kI]−1 , we show [I + k(X0X)−1] I − k[X0X + kI]−1 = I + k(X0X)−1 − k[X0X + kI]−1 − k(X0X)−1k[X0X + kI]−1 = I + k(X0X)−1 − k[X0X + kI]−1[I + k(X0X)−1] = I + k(X0X)−1 − k[X0X + kI]−1[X0X + k(X0X)−1(X0X)](X0X)−1 = I + k(X0X)−1 − k(X0X)−1 = I

SydU MSH3 GLM (2015) First semester Dr. J. Chan 133