<<

8TheLikelihoodRatioTest

8.1 The likelihood ratio We often want to test in situations where the adopted probability model involves several unknown . Thus we may denote an element of the space by

θ =(θ1,θ2,...θk) Some of these parameters may be nuisance parameters, (e.g. testing hypothe- ses on the unknown of a normal distribution with unknown , where the variance is regarded as a nuisance parameter). We use the likelihood ratio, λ(x),definedas

sup {L(θ; x):θ ∈ Θ } λ(x)= 0 , x ∈ Rn . sup {L(θ; x):θ ∈ Θ} X

The informal argument for this is as follows.

For a realisation x, determine its best chance of occurrence under H0 and also its best chance overall. The ratio of these two chances can never exceed unity, but, if small, would constitute evidence for rejection of the null hypothesis.

A likelihood ratio test for testing H0 : θ ∈ Θ0 against H1 : θ ∈ Θ1 is a test with critical region of the form

C1 = {x : λ(x) ≤ k} , where k is a real number between 0 and 1. Clearly the test will be at significance level α if k can be chosen to satisfy

sup {P (λ(X) ≤ k; θ ∈ Θ0)} = α.

If H0 is a simple hypothesis with Θ0 = {θ0}, we have the simpler form

P (λ(X) ≤ k; θ0)=α.

To determine k, we must look at the c.d.f. of the λ(X), where the random X has joint p.d.f. fX(x; θ0).

69 Example Exponential distribution

Test H0 : θ = θ0 against H1 : θ>θ0.

Here Θ0 = {θ0}, Θ1 =[θ0, ∞). The is

n  n −θ xi L(θ; x)= f(xi; θ)=θ e . i=1 The numerator of the likelihood ratio is

n −nθ0x L(θ0; x)=θ0 e .

We need to find the supremum as θ ranges over the interval [θ0, ∞).Now

l(θ; x)=n log θ − nθx so that ∂l(θ; x) n = − nx ∂θ θ which is zero only when θ =1/x. Since L(θ; x) is an increasing function for θ<1/x and decreasing for θ>1/x, x−ne−n, if 1/x ≥ θ sup {L(θ; x):θ ∈ Θ} = 0 . n −nθ0x θ0 e if 1/x<θ0

70

L(θ;x) sup{ L(θ;x):θ ∋ Θ}

θ

1/x θ0

L(θ;x) sup{ L(θ;x):θ ∋ Θ}

θ

θ0 1/x   θne−nθ0x 0 , 1/x ≥ θ λ(x)= −n −n 0  x e 1, 1/x<θ0 θnxne−nθ0xen, 1/x ≥ θ = 0 0 1, 1/x<θ0 Since d xne−nθ0x = nxn−1e−nθ0x (1 − θ x) dx 0 is positive for values of x between 0 and 1/θ0 where θ0 > 0, it follows that λ(x) is a non-decreasing function of x. Therefore the critical region of the likelihood ratio test is of the form n C1 = x : xi ≤ c . i=1 Example The one-sample t-test

The null hypothesis is H0 : θ = θ0 for the mean of a normal distribution with unknown variance σ2.

71 We have Θ = {(θ, σ2):θ ∈ R,σ2 ∈ R+} 2 2 + Θ0 = {(θ, σ ):θ = θ0,σ ∈ R } and 1 1 f(x; θ, σ2)=√ exp − (x − θ)2 ,x∈ R. 2πσ2 2σ2 The likelihood function is 1 n L(θ, σ2; x)=(2πσ2)−n/2 exp − (x − θ)2 2σ2 i i=1 Since n 1 n l(θ ,σ2; x)=− log(2πσ2) − (x − θ )2 0 2 2σ2 i 0 i=1 and ∂l n 1 n = − + (x − θ )2, ∂σ2 2σ2 2σ4 i 0 i=1 which is zero when 1 n σ2 = (x − θ )2 n i 0 i=1 we conclude that

−n/2 2π n sup L(θ ,σ2; x) = (x − θ )2 e−n/2 . 0 n i 0 i=1

For the denominator, we already know from previous examples that the m.l.e. of θ is x,so

−n/2 2π n sup L(θ, σ2; x) = (x − x)2 e−n/2 n i i=1 and n (x − θ )2 −n/2 λ(x)= i=1 i 0 . n 2 i=1(xi − x) This may be written in a more convenient form. Note that

n n 2 2 (xi − θ0) = ((xi − x)+(x − θ0)) i=1 i=1

n 2 2 = (xi − x) + n(x − θ0) i=1

72 so that n(x − θ )2 −n/2 λ(x)= 1+ 0 . n 2 i=1(xi − x) The critical region is

C1 = {x : λ(x) ≤ k} so it follows that H0 is to be rejected when the value of |x − θ | 0 n 2 i=1(xi − x) exceeds some constant. Now we have already seen that

X − θ √ ∼ t(n − 1) S / n where 1 n S2 = (X − X)2. n − 1 i i=1 Therefore it makes sense to write the critical region in the form |x − θ | C = x : √ 0 ≥ c 1 s / n which is the standard form of the two-sided t-test for a single sample.

73 8.2 The likelihood ratio statistic Since the function −2logλ(x) is a decreasing function, it follows that the critical region of the likelihood ratio test can also be expressed in the form

C1 = {x : −2 log λ(x) ≥ c} .

Writing Λ(x)=−2logλ(x)=2 l(θ : x) − l(θ0 : x) the critical region may be written as

C1 = {x :Λ(x) ≥ c} and Λ(X) is called the likelihood ratio statistic. We have been using the idea that values of θ close to θ are well supported by the so, if θ0 is a possible value of θ, then it turns out that, for large samples, D 2 Λ(X) → χp where p = dim(θ).

Letusseewhy.

8.2.1 The asymptotic distribution of the likelihood ratio statistic Write 1 l(θ )=l(θ)+(θ − θ )l(θ)+ (θ − θ )2l(θ)+... 0 0 2 0 and, remembering that l(θ)=0, we have 2 Λ (θ − θ0) −l (θ)

2 =(θ − θ0) J(θ)

J(θ) 2 =(θ − θ0) I(θ0) . I(θ0) But J(θ) 1/2 D P (θ − θ0)I(θ0) → N(0, 1) and → 1 I(θ0) so 2 D 2 (θ − θ0) I(θ0) → χ1

74 and Slutsky’s theorem gives D 2 Λ → χ1 provided θ0 is the true value of θ. Example Poisson distribution

Let X =(X1,...,Xn) be a random sample from a Poisson distribution with parameter θ, and test H0 : θ = θ0 against H1 : θ = θ0 at significance level 0.05. The p.m.f. is e−θθx p(x; θ)= ,x=0, 1,... x! so that n n l(θ : x)=−nθ + xi log θ − log xi! i=1 i=1 and ∂l(θ : x) 1 n = −n + x ∂θ θ i i=1 giving θ = x. Therefore x Λ=2n θ0 − x + x log . θ0 2 2 The distribution of Λ under H0 is approximately χ1 and χ1(0.95)=3.84,so the critical region of the test is x C1 = x :2n θ0 − x + x log ≥ 3.84 . θ0

75 8.3 Testing goodness-of-fit for discrete distributions The data below were collected by the ecologist E.C. Pielou, who was inter- ested in the pattern of healthy and diseased trees. The subject of her re- search was Armillaria root rot in a plantation of Douglas firs. She recorded the lengths of 109 runs of diseased trees and these are given below.

Runlength 1 23456 Number of runs 71 28 5221

On biological grounds, Pielou proposed a geometric distribution as a proba- bility model. Is this plausible? Let’s try to answer this by first looking at the general case.

th Suppose we have k groups with ni in the i group. Thus Group 1 2 3 4 ··· k

Number n1 n2 n3 n4 ··· nk where i ni = n. Suppose further that we have a probability model such thatπi(θ),i= th 1, 2,...,k, is the probability of being in the i group. Clearly i πi(θ)=1. The likelihood is k π (θ)ni L(θ)=n! i n ! i=1 i and the log-likelihood is

k k l(θ)= ni log πi(θ)+logn! − log ni! i=1 i=1

Suppose θ maximises l(θ), being the solution of l(θ)=0.

The general alternative is to take πi as unrestricted by rthe model and subject only to i πi =1. Thus we maximise k k l(π)= ni log πi +logn! − log ni! with g(π)= πi =1. i=1 i=1 i Using Lagrange multiplier γ we obtain the set of k equations ∂l ∂g − γ =0, 1 ≤ i ≤ k, ∂πi ∂πi 76 or n i − γ =0, 1 ≤ i ≤ k. πi Writing this as

ni − γπi =0, 1 ≤ i ≤ k and summing over i we find γ = n and n π = i . i n The likelihood ratio statistic is k n Λ=2 n log i − k n log π (θ) i n i=1 i i i=1 k n =2 n log i . i i=1 nπi(θ)

General statement of asymptotic result for the likelihood ratio statistic

Testing H0 : θ ∈ Θ0 ⊂ Θ against H1 : θ ∈ Θ, the likelihood ratio statistic D 2 Λ=2 sup l(θ) − sup l(θ) → χp, θ∈Θ θ∈Θ0 where

p =dimΘ− dim Θ0

In the general case above where k n Λ=2 n log i , i i=1 nπi(θ)

k the restriction i=1 πi =1means that dim Θ = k − 1. Clearly dim Θ0 =1 so p = k − 2 and D 2 Λ → χk−2.

Example Pielou’s data These are Runlength 1 23456 Number of runs 71 28 5221

77 and Pielou proposed a geometric model with p.m.f.

p(x)=(1− θ)x−1θ, x =1, 2,... where x is the observed run length. Thus, if xj, 1 ≤ j ≤ n, are the observed run lengths, the log-liklihood for Pielou’s model is

n l(θ)= (xj − 1) log(1 − θ)+n log θ j=1 and, maximising, ∂l(θ) n x − n n = − j=1 j + ∂θ (1 − θ) θ which gives 1 θ = . x By the invariance property of m.l.e.’s

(x − 1)i π (θ)=(1− θ)i−1θ = . i xi

The data give x =1.523. We can therefore use the expression for πi(θ) to calculate k n Λ=2 n log i =3.547. i i=1 nπi(θ) There are six groups, so p =6− 1 − 1=4.

2 The approximate distribution of Λ is therefore χ4 and P(Λ ≥ 3.547) = 0.471.

There is no evidence against Pielou’s conjecture that a geometric distribution is an appropriate model. Example Two-way Data are obtained by cross-classifying a fixed number of individuals according to two criteria. They are therefore displayed as nij inatablewithr rows and c columns as follows.

n11 ··· n1c n1......

nr1 ··· nrc nr. n.1 ··· n.c n

78 The aim is to investigate the independence of the two classifications. th Suppose the k individual goes into cell (Xk,Yk),k=1, 2,...,n, and that individuals are independent. Let

P ((Xk,Yk)=(i, j)) = θij,i=1, 2,...,r; j =1, 2,...,c, where ij θij =1. The null hypothesis of independence of classifiers can be written H0 : θij = φiρj. ThisisonProblem Sheet 4 so here are a few hints. The likelihood function is θnij L(θ)=n! ij n ! i,j ij so the log-likelihood is l(θ)= nij log θij + log n! − log nij! i,j i,j Under H , put θ = φ ρ and maximise with respect to φ and ρ subject to 0 ij i j i j i φi = j ρj =1. You will obtain n n φ = i. , ρ = .j i n j n Under H1, maximise with respect to θij subject to ij θij =1. You will obtain n θ = ij ij n and, finally r c n n Λ=2 n log ij . ij n n i=1 j=1 i. .j Example An historic data set - crime and drinking

79 These are Pearson’s 1909 data on crime and drinking.

Crime Drinker Abstainer Arson 50 43 Rape 88 62 Violence 155 110 Stealing 379 300 Coining 18 14 Fraud 63 144

Is crime drink related? For these data, Λ=50.52.

2 Under H0, Λ ∼ χp, where p =dimΘ− dim Θ0. In the notation used earlier, there are apparently 6 values of φ to estimate, but in fact there are only 5 i values because i φi =1. Similarly there are 2 − 1=1values of ρj.Thus dim Θ0 =6. Because ij θij =1, dimΘ=12− 1=11so, therefore, p =11− 6=5. Testing against a χ2-distribution with 5 degrees of freedom, note that the 0.9999 is 25.75 and we can reject at the 0.0001 level of significance. There there is overwhelming evidence that crime and drink are related. Degrees of freedom It is clear from the above that, when testing contingency tables, the number of degrees of freedom of the resulting χ2-distribution is given, in general, by

p = rc − 1 − (r − 1) − (c − 1) = rc − r − c +1 =(r − 1)(c − 1).

80 8.4 Pearson’s statistic

For testing independence in contingency tables, let Oij be the observed num- ber in cell (i, j),i=1, 2,...,r; j =1, 2,...,c, and Eij be the expected number in cell (i, j). Pearson’s statistic is (O − E )2 P = ij ij ∼ χ2 . E (r−1)(c−1) i,j ij

The expected number Eij in cell (i, j) is calculated under the null hypothesis of independence. th If ni. is the total for the i row and the overall total is n, then the probability of an observation being in the ith row is estimated by n P(ith row)= i. . n Similarly n P(jth column)= .j n and th th Eij = n × P(i row) × P(j column) n n == i. .j . n

Example Crime and drinking These are the data on crime and drinking with the row and column totals. Crime Drinker Abstainer Total Arson 50 43 93 Rape 88 62 150 Violence 155 110 265 Stealing 379 300 679 Coining 18 14 32 Fraud 63 144 207 Total 753 673 1426

The Eij are easily calculated. 93 × 753 E = =49.11, and so on. 11 1426 Pearson’s statistic turns out to be P =49.73, which is tested against a χ2- distribution with (6 − 1) × (2 − 1)=5degrees of freedom and the conclusion is, of course, the same as before.

81 8.4.1 Pearson’s statistic and the likelihood ratio statistic (O − E )2 P = ij ij i,j Eij

2 n − ni.n.j ij n = n n i,j i. .j n Consider the Taylor expansion of x log(x/a) about x = a.

x (x − a)2 (x − a)3 x log =(x − a)+ − + ··· a 2a 6a2 n n Now put x = n and a = i. .j so that ij n

2 n n n n n − ni.n.j n log ij = n − i. .j + ij n + ··· ij n n ij n 2 ni.n.j i. .j n Thus nijn nij log i,j ni.n.j

n n (O − E )2 = n − n i. .j + 1 ij ij + ··· 1 P 2 2 i n j n i,j Eij or Λ P

82