<<

21 Testing Problems

Testing is certainly another fundamental problem of inference. It is interestingly different from estimation. The accuracy measures are fundamentally different and the appropriate asymptotic theory is also different. Much of the theory of testing has revolved somehow or other on the Neyman-Pearson lemma, which has led to a lot of ancillary developments in other problems in mathematical . Testing has ledto the useful idea of local alternatives, and like maximum likelihood in estimation, likelihood ratio, Wald, and Rao tests have earned the status of default methods, with a neat and quite unified asymptotic theory. Because of all these reasons, a treatment of testing is essential. We discuss the asymptotic theory of likelihood ratio, Wald, and Rao score tests in this chapter. Principal references for this chapter are Bickel and Doksum (2001), Ferguson (1996), and Sen and Singer (1993). Many other specific references are given in the sections.

21.1 Likelihood Ratio Tests

The likelihood ratio test is a general omnibus test applicable, in principle, in most (n) finite dimensional parametric problems. Thus, let X =(X1,...,Xn)bethe n (n) observed with joint distribution Pθ ¿ µn, θ ∈ Θ, and density fθ(x )= n dPθ /dµn. Here, µn is some appropriate σ-finite measure on Xn, which we assume to be a subset of an Euclidean space. For testing H0 : θ ∈ Θ0 vs. H1 : θ ∈ Θ − Θ0, the likelihood ratio test (LRT) rejects H0 for small values of

(n) supθ∈Θ0 fθ(x ) n Λ = (n) supθ∈Θ fθ(x )

The motivation for Λn comes from two sources:

(a) The case where H0, H1 are each simple, a most powerful (MP) test is found from Λn by the Neyman-Pearson lemma.

(b) The intuitive explanation that for small values of Λn we can better match the observed data with some value of θ outside of Θ0.

LRTs are useful because they are omnibus tests and because otherwise optimal tests for a given sample size n are generally hard to find outside of the . However the LRT is not a universal test. There are important examples where the LRT simply cannot be used, because the null distribution of the LRT test depends on nuisance parameters. Also, the exact distribution of the LRT

319 statistic is very difficult or impossible to find in many problems. Thus, asymptotics become really important. But the asymptotics of the LRT may be nonstandard under nonstandard conditions. Although we only discuss the case of a parametric model with a fixed finite di- mensional , LRTs and their useful modifications have been studied when the number of parameters grows with the sample size n, and in nonparamet- ric problems. See, e.g., Portnoy(1988),Fan,Hung and Wong(2000), and Fan and Zhang(2001). We start with a series of examples, which illustrate various important aspects of the likelihood ratio method.

21.2 Examples

iid 2 Example 21.1. Let X1,X2,... ,Xn ∼ N(µ, σ ) and consider testing

H0 : µ =0vs.H1 : µ =06 .

Let θ =(µ, σ2). Then,

¡ P ¢ µP ¶n/ n 1 2 2 − i − − 2 supθ∈Θ0 (1/σ )exp 2σ2 i(X µ) i(Xi Xn) n ¡ P ¢ P Λ = n 1 = 2 − i − 2 supθ∈Θ(1/σ )exp 2σ2 i(X µ) i Xi

by an elementary calculation of mles of θ under H0 and in the general parameter 2 space. By another elementary calculation, Λn k where √ nXn tn = q P 1 2 i − n n−1 i(X X ) is the t-statistic. In other words, the t-test is the LRT. Also, observe that,

2 2 nXn tn = P 1 2 i − n n−1 i(X X ) P P 2 2 X − (Xi − Xn) = i i P i 1 − 2 n− i(Xi Xn) 1 P − 2 (n 1) i Xi − − = P 2 (n 1) i(Xi − Xn) −2/n =(n − 1)Λn − (n − 1)

320 This implies, µ ¶n/ n − 1 2 n Λ = 2 tn + n − 1 n n − 1 ⇒ log Λn = log 2 t2 + n − 1 µ n ¶ 2 tn ⇒−2logΛn = n log 1+ n − 1 µ µ ¶¶ 2 2 tn tn = n + op n − 1 n − 1 −→L 2 χ1 L under H0 since tn −→ N(0, 1) under H0.

Example 21.2. Consider a multinomial distribution MN(n, p1, ··· ,pk). Consider testing

H0 : p1 = p2 = ···= pk vs. H1 : H0 is not true.

Let n1, ··· ,nk denote the observed cell frequencies. Then by an elementary calcu- lation, µ ¶ ni k n Λn =Πi=1 kni k k X ⇒−log Λn = n(log )+ ni log ni n i=1 The exact distribution of this is a messy discrete object and so asymptotics will be useful. We illustrate the for k =2.Inthiscase,

− log Λn = n log 2 − n log n + n1 log n1 +(n − n1)log(n − n1) p Let Zn =(n1 − n/2)/ n/4. Then,

− log Λn = n log 2 − n log n √ √ √ √ n + nZn n + nZn n − nZn n − nZn + log + log 2 √ 2 2 √ 2 n + nZn √ n − nZn √ = −n log n + log(n + nZn)+ log(n − nZn) √ 2 √ 2 n + nZn Zn n − nZn Zn = log(1 + √ )+ log(1 − √ ) 2 n 2 n √ µ µ ¶¶ 2 2 n + nZn Zn Zn Zn = √ − + op 2 n 2n n

321 √ µ µ ¶¶ 2 2 n − nZn Zn Zn Zn + −√ − + op 2 n 2n n 2 Zn = + op(1). 2 L L − −→ 2 −→ Hence 2logΛn χ1 under H0 as Zn N(0, 1) under H0. Remark: The popular test in this problem is the Pearson chi-square test which rejects H0 for large values of P k − 2 i=1(ni n/k) n/k

2 Interestingly this test statistic too has an asymptotic χ1 distribution as does the LRT statistic.

Example 21.3. This example shows that the chi-square asymptotics of the LRT statistic fail when parameters have constraints or somehow there is a boundary phe- nomenon.

iid Let X1,X2,... ,Xn ∼ N2(θ, I) where the parameter space is restricted to Θ = {θ =

(θ1,θ2):θ1 ≥ 0,θ2 ≥ 0}. Consider testing

H0 : θ =0vs. H1 : H0 is not true.

T We would write the n realizations as xi =(x1i,x2i) and the sample average would be denoted by x =(x1, x2). The mle of θ is given by, Ã ! x ∨ 0 θˆ = 1 x2 ∨ 0

Therefore by an elementary calculation, P P − 2 − 2 exp( i x1i/2 i x2i/2) n P P Λ = 2 2 exp(− i(x1i − x1 ∨ 0) /2 − i(x2i − x2 ∨ 0) /2)

Case 1: x1 ≤ 0, x2 ≤ 0. In this case Λn =1and−2logΛn =0.

≤ − 2 Case 2: x1 > 0, x2 0. In this case 2logΛn = nx1.

≤ − 2 Case 3: x1 0, x2 > 0. In this case 2logΛn = nx2.

322 − 2 2 Case 4: x1 > 0, x2 > 0. In this case 2logΛn = nx1 + nx2.

Now, under H0 each of the above four cases has a probability 1/4 of occurrence.

Therefore, under H0,

L 1 1 2 1 2 −2logΛn −→ δ{ } + χ + χ , 4 0 2 1 4 2 in the sense of the mixture distribution being its weak limit.

Example 21.4. In bio-equivalence trials, a brand name drug and a generic are compared with regard to some important clinical variable, such as average drug concentration in blood over a 24 hour time period. By testing for a difference, the problem is reduced to a single variable, often assumed to be normal. Formally, one iid 2 has X1,X2,... ,Xn ∼ N(µ, σ ) and the bio-equivalence hypothesis is:

H0 : |µ|≤² for some specified ²>0.

Inference is always significantly harder if known constraints on the parameters are enforced. Casella and Strawderman(1980) is a standard introduction to the normal problem with restrictions on the mean; Robertson,Wright and Dykstra (1988) is an almost encyclopedic exposition. Coming back to our example, to derive the LRT, we need the restricted mle and the non-restricted mle. We will assume here that σ2 is known. Then the MLE under

H0 is ¯ ¯ µˆH0 = X if |X|≤² = ²sign(X¯)if|X¯| >².

Consequently, the LRT statistic Λn satisfies,

−2logΛn =0 if |X¯|≤² n = (X¯ − ²)2 if X>²¯ σ2 n = (X¯ + ²)2 if X<¯ −². σ2 Take a fixed µ. Then, µ√ ¶ µ√ ¶ n(² − µ) n(² + µ) Pµ(|X¯|≤²)=Φ +Φ − 1 σ σ µ √ ¶ n(² + µ) Pµ(X<¯ −²)=Φ − σ

323 and µ√ ¶ n(² − µ) Pµ(X>²¯ )=1− Φ σ From these expressions we get: ¯ L Case 1: If −²<µ<²then Pµ(|X|≤²) → 1andso−2logΛn −→ δ{0}.

Case 2.a: If µ = −² then each of Pµ(|X¯|≤²)andPµ(X<¯ −²) converges to 1/2. In L 1 1 2 − n −→ this case, 2logΛ 2 δ{0} + 2 χ1, i.e., the mixture of a point mass and a chisquare distribution.

L 1 1 2 − n −→ Case 2.b: Similarly if µ = ² then again, 2logΛ 2 δ{0} + 2 χ1.

Case 3: If µ>²then Pµ(|X¯|≤²) → 0andPµ(X>²¯ ) → 1. In this case, L 2 2 2 −2logΛn −→ NCχ (1, (µ − ²) /σ )), a noncentral chisquare distribution.

L 2 2 2 Case 4: Likewise, if µ<−² then −2logΛn −→ NCχ (1, (µ + ²) /σ )).

So, unfortunately, even the null asymptotic distribution of − log Λn depends on the exact value of µ.

Example 21.5. Consider the Behrens-Fisher problem with

iid ∼ 2 X1,X2,... ,Xm N(µ1,σ1) iid ∼ 2 Y1,Y2,... ,Yn N(µ2,σ2) and all m + n observations are independent. We want to test

H0 : µ1 = µ2 vs. H1 : µ1 =6 µ2.

2 2 2 2 Letµ, ˆ σˆ1, σˆ2 denote the restricted mle of µ, σ1 and σ2 respectively, under H0.They satisfy the equations, X 2 ¯ − 2 1 − ¯ 2 σˆ1 =(X µˆ) + (Xi X) m i X 2 ¯ − 2 1 − ¯ 2 σˆ2 =(Y µˆ) + (Yj Y ) n j m(X¯ − µˆ) n(Y¯ − µˆ) 0= 2 + 2 σˆ1 σˆ2

324 This gives,

µ P ¶ m µ P ¶ n 1 − ¯ 2 2 1 − ¯ 2 2 m (Xi X) n (Yj Y ) −2logΛn = P P 1 − ¯ 2 ¯ − 2 1 − ¯ 2 ¯ − 2 m (Xi X) +(X µˆ) n (Yj Y ) +(Y µˆ) However, the LRT fails as a matter of practical use because its distribution depends 2 2 on the nuisance parameter σ1/σ2, which is unknown.

Example 21.6. In , the mles in nonregular problems behave fun- damentally differently with respect to their asymptotic distributions. For exam- ple, limit distributions of mles are not normal (see Chapter 16). It is interesting, therefore, that limit distributions of −2logΛn in nonregular cases still follow the chi-square recipe, but with different degrees of freedom.

iid As an example, suppose X1,X2,... ,Xn ∼ U[µ − σ, µ + σ] and suppose we want to test H0 : µ =0.LetW = X(n) − X(1) and U =max(X(n), −X(1)). Under H0, U is complete and sufficient and W/U is an ancillary statistic. Therefore, by Basu’s theorem (Basu (1955)), under H0, U and W/U are independent. This will be useful shortly to us. Now, by a straightforward calculation,

µ ¶n W Λn = 2U

⇒−2logΛn =2n(log 2U − log W )

Therefore, under H0,

Ee2nit(− log W ) Eeit(−2logΛn) = Ee2nit(log 2U−log W ) = Ee2nit(− log 2U) by the independence of U and W/U. This works out, on doing the calculation, to n − 1 1 Eeit(−2logΛn) = → n(1 − 2it) − 1 1 − 2it

− −1 2 Since (1 2it) is the characteristic function of the χ2 distribution, it follows that L − −→ 2 2logΛn χ2.

Thus inspite of the nonregularity, −2logΛn is asymptotically chi-square, but we gain a degree of freedom! This phenomenon holds more generally in nonregular cases.

325 Example 21.7. Consider the problem of testing for equality of two Poisson . iid iid Thus let X1,X2,... ,Xm ∼ Poi(µ1)andY1,Y2,... ,Yn ∼ Poi(µ2), where Xi’s and

Yj’s are independent. Suppose we wish to test H0 : µ1 = µ2. We assume m, n →∞ in such a way that m → λ with 0 <λ<1 m + n

The restricted mle for µ = µ1 = µ2 is P P i Xi + j Yj mX¯ + nY¯ = m + n m + n Therefore, on an easy calculation,

¡ ¢mX¯ nY¯ (mX¯ + nY¯ )/(m + n) + Λm,n = X¯ mX¯ Y¯ nY¯

⇒−log Λm,n = mX¯ log X¯ + nY¯ log Y¯ µ ¶ m n −(mX¯ + nY¯ )log X¯ + Y¯ m + n m + n

The asymptotic distribution of − log Λm,n can be found by a two-term Taylor ex- pansion, by using the multivariate delta theorem. This would be a more direct derivation, but we will instead present a derivation based on an asymptotic tech- nique that is useful in many problems. Define, √ √ m(X¯ − µ) n(Y¯ − µ) Z = Z ,m = √ ,Z = Z ,m = √ 1 1 µ 2 2 µ where µ = µ1 = µ2 is the common value of µ1 and µ2 under H0. Substituting in the expression for log Λm,n we have, · µ ¶ ¸ √ Z1 − log Λm,n =(mµ + mµZ ) log 1+√ +logµ 1 mµ · µ ¶ ¸ √ Z +(nµ + nµZ ) log 1+√ 2 +logµ 2 nµ · µ ¶ ¸ √ √ m Z n Z −((m + n)µ + mµZ + nµZ ) log 1+ √ 1 + √ 2 +logµ 1 2 m + n mµ m + n nµ µ ¶ √ 2 Z1 Z1 −3/2 =(mµ + mµZ ) √ − + Op(m )+logµ 1 mµ 2mµ µ ¶ √ 2 Z2 Z2 −3/2 +(nµ + nµZ ) √ − + Op(n )+logµ 2 nµ 2nµ

326 √ √ m Z n Z −((m + n)µ + mµZ + nµZ )( √ 1 + √ 2 1 2 m + n mµ m + n nµ 2 2 − m Z1 − n Z2 (m + n)2 2µ (m + n)2 2µ √ mn Z1Z2 −3/2 − + Op(min(m, n) )+logµ) (m + n)2 µ On further algebra, from above, by breaking down the terms, and on cancellation, µ ¶ µ ¶ 2 1 m 2 1 n − log Λm,n = Z − + Z − 1 2 2(m + n) 2 2 2(m + n) √ mn − Z Z + op(1) m + n 1 2 Assuming that m/(m + n) → λ, it follows that, µr r ¶ 1 n m 2 − log Λm,n = Z − Z + op(1) 2 m + n 1 m + n 2 √ √ L 1 −→ ( 1 − λN − λN )2 2 1 2 L √ − −→ 2 − − where√ N1,N2 are independent N(0, 1). Therefore, 2logΛm,n χ1,since 1 λN1 λN2 ∼ N(0, 1).

Example 21.8. This example gives a general formula for the LRT statistic for test- ing about the natural parameter vector in q-dimensional multiparameter exponential family. What makes the example special is a link of the LRT to the Kullback-Leibler measure in the exponential family.

iid 0 θ T (x) Suppose X1,X2,... ,Xn ∼ p(x|θ)=e c(θ)h(x)(dµ). Suppose we want to test

θ ∈ Θ0 vs. H1 : θ ∈ Θ − Θ0 where Θ is the full natural parameter space. Then, by definition, P n 0 sup c(θ) exp(θ T (xi)) − − θ∈Θ0 P i 2logΛn = 2log n 0 supθ∈Θ c(θ) exp(θ i T (xi)) − 0 0 = 2sup[n log c(θ0)+nθ0T ]+2sup[n log c(θ)+nθ T ] θ0∈Θ0 θ∈Θ 0 =2n sup inf [(θ − θ0) T +logc(θ) − log c(θ0)] θ∈Θ θ0∈Θ0 on a simple rearrangement of the terms. Recall now (see Chapter 2) that the Kullback-Leibler divergence, K(f,g) is defined to be,

327 Z f f(x) K(f,g)=Ef log = log f(x)µ(dx) g g(x) It follows that,

− 0 − K(pθ,pθ0 )=(θ θ0) EθT (X)+logc(θ) log c(θ0) ∈ for any θ, θ0 Θ. We will write K(θ, θ0) for K(pθ,pθ0 ). Recall also that in the general exponential family, the unrestricted mle θˆ of θ exists for all large n and ˆ furthermore, θ is a estimate given by Eθ=θˆ(T )=T . Consequently,

ˆ 0 ˆ −2logΛn =2n sup inf [(θ − θ0) Eθˆ(T )+logc(θ) − log c(θ0)+ θ ∈ θ∈Θ 0 Θ0 ˆ 0 ˆ (θ − θ) Eθˆ(T )+logc(θ) − log c(θ)] ˆ ˆ 0 ˆ =2n sup inf [K(θ, θ0)+(θ − θ) Eθˆ(T )+logc(θ) − log c(θ)] θ∈Θ θ0∈Θ0 ˆ ˆ 0 ˆ =2n inf K(θ, θ0)+2n sup[(θ − θ) T n +logc(θ) − log c(θ)] θ ∈ 0 Θ0 θ∈Θ The second term in the above vanishes as the the supremum is attained at θ = θˆ. Thus we ultimately have the identity,

ˆ −2logΛn =2n inf K(θ, θ0) θ0∈Θ0 The connection is beautiful. Kullback-Leibler divergence being a measure of dis- ˆ tance, infθ0∈Θ0 K(θ, θ0) quantifies the disagreement between the null and the esti- mated value of the true parameter θ, when estimated by the mle. The formula says that when the disagreement is small one should accept H0.

The link to the K-L divergence is special for the exponential family; for example, if

1 − 1 (x−θ)2 pθ(x)=√ e 2 , 2π

2 then, for any θ, θ0, K(θ, θ0)=(θ − θ0) /2 by a direct calculation. Therefore,

1 ˆ 2 ¯ 2 −2logΛn =2n inf (θ − θ) = n inf (X − θ0) θ0∈Θ0 2 θ0∈Θ0

If H0 is simple, i.e., H0 : θ = θ0, then the above expression simplifies to −2logΛn = 2 n(X¯ − θ0) , just the familiar chisquare statistic.

328 21.3 Asymptotic Theory of Likelihood Ratio Test Statistics

iid Suppose X1,X2,... ,Xn ∼ f(x|θ), densities with respect to some dominating mea- sure µ. We present the asymptotic theory of the LRT statistic for two types of null hypothesis. Suppose θ ∈ Θ ⊆ Rq for some 0

dim Θ = q. One type of null hypothesis we consider is H0 : θ = θ0 (specified). A sec-

ond type of null hypothesis is that for some 0

h1(θ)=h2(θ)=···= hr(θ)=0.

For each case, under regularity conditions to be stated below, the LRT statistic

is asymptotically a central chi-square under H0 with a fixed degree of freedom,

including the case where H0 is composite. iid∼ iid∼ Example 21.9. Suppose X11,... ,X1n1 Poisson (θ1), X21,... ,X2n2 Poisson iid∼ ∞ (θ2)andX31,... ,X3n3 Poisson (θ3)where0<θ1,θ2,θ3 < , and all observations

are independent. An example of H0 of the first type is H0 : θ1 =1,θ2 =2,θ3 =1.

An example of H0 of the second type is H0 : θ1 = θ2 = θ3. The functions h1,h2 may

be chosen as h1(θ)=θ1 − θ2,h2(θ)=θ2 − θ3. − 2 In the first case, 2logΛn is asymptotically χ3 under H0 and in the second case 2 ∈ it is asymptotically χ2 under each θ H0. This is a special example of a theorem originally stated by Wilks (1938); also see Lawley (1956). The degree of freedom

of the asymptotic chi-square distribution under H0 is the number of independent

constraints specified by H0;itisusefultorememberthisasageneralrule.

iid ∼ | dPθ Theorem 21.1. Suppose X1,X2,... ,Xn fθ(x)=f(x θ)= dµ for some domi- nating measure µ. Suppose θ ∈ Θ, an affine subspace of Rq of dimension q. Suppose that the conditions required for asymptotic normality of any strongly consistent se-

quence of roots θˆn of the likelihood equation hold (see Chapter 16). Consider the

problem H0 : θ = θ0 against H1 : θ =6 θ0.Let Qn f(xi|θ0) i=1 Λn = , l(θˆn)

L 2 where l(·) denotes the . Then −2logΛn −→ χq under H0.

The proof can be seen in Ferguson (1996), Bickel and Doksum (2001), or Sen and

329 Singer (1993).

Remark: As we have seen in our illustrative examples earlier in this chapter, the theorem can fail if the regularity conditions fail to hold. For instance, if the null value is a boundary point in a constrained parameter space, then the result will fail.

Remark: This theorem is used in the following way. To use the LRT with an exact level α, we need to find cn,α such that PH0 (Λn

2 −2logΛn >χα,q.

2 The theorem above implies that P (−2logΛn >χα,q) → α and so P (−2logΛn > 2 χα,q) − P (Λn

Under the second type of null hypothesis the same result holds with the degree of freedom being r. Precisely, the following holds:

Theorem 21.2. Assume³ all the´ regularity conditions in the previous theorem. De- ∂ fine Hq×r = H(θ)= ∂θ hj(θ) 1≤i≤q ; it is assumed that the required partial deriva- i 1≤j≤r tives exist. Suppose for each θ ∈ Θ0, H has full column rank. Define sup l(θ) θ:hj (θ)=0,1≤j≤r Λn = . l(θˆn)

L 2 Then, −2logΛn −→ χr under each θ in H0. A proof can be seen in Sen and Singer(1993).

21.4 Distribution under Alternatives

2 To find the power of the test that rejects H0 when −2logΛn >χα,d for some d, one would need to know the distribution of −2logΛn at the particular θ = θ1 value where we want to know the power. But the distribution under θ1 of −2logΛn for a fixed n is also generally impossible to find. So we may appeal to asymptotics.

However, there cannot be a nondegenerate limit distribution on [0, ∞) for −2logΛn under a fixed θ1 in the alternative. The following simple example illustrates this difficulty.

330 iid 2 2 Example 21.10. Suppose X1,X2,... ,Xn ∼ N(µ, σ ), µ, σ both unknown, and we wish to test H0 : µ =0vs. H1 : µ =6 0. We saw earlier that µP ¶ µ P ¶ 2 n/2 2 n/2 (Xi − X¯) (Xi − X¯) Λn = P = P . 2 2 2 Xi (Xi − X¯) + nX¯ µ ¶ µ ¶ nX¯ 2 X¯ 2 ⇒−2logΛn = n log 1+P = n log 1+ P − ¯ 2 1 − 2 (Xi X) n (Xi x¯) P 6 ¯ 2 −→a.s 2 1 − ¯ 2 −→a.s 2 Consider now a value µ =0.ThenX µ (> 0) and n (Xi X) σ . a.s Therefore, clearly −2logΛn −→ ∞ under each fixed µ =6 0. There cannot be a bona

fide limit distribution for −2logΛn under a fixed alternative µ. √∆ However, if we let µ depend on n and take µn = n for some fixed but arbitrary ∆, 0 < ∆ < ∞,then−2logΛn still has a bona fide limit distribution under the sequence of alternatives µn. √∆ Sometimes alternatives of the form µn = n are motivated by arguing that one wants thetesttobepowerfulatvaluesofµ close to the null value. Such a property would correspond to a sensitive test. These alternatives are called Pitman alternatives.

The following result holds. We present the case hi(θ)=θi for notational simplicity.

Theorem 21.3. Assume the same regularity conditions as in the previous theo- 0 rems. Let θq×1 =(θ1,η) where θ1 is r × 1andη is (q − r) × 1, a vector of nuisance √∆ parameters. Let H0 : θ1 = θ10 (specified) and let θn³=(θ10 + n ,η).´ Let I(θ) ∂2 ij − | be the matrix, with I (θ)= E ∂θi∂θj log f(X θ) .Denote −1 V (θ)=I (θ)andVr(θ) the upper (r × r) principal submatrix in V (θ). Then L 2 0 −1 2 2 −2logΛn −→ NCχ (r, ∆ Vr (θ0,η)∆) where NCχ (r, δ) denotes the noncentral χ Pθn distribution with r degrees of freedom and noncentrality parameter δ.

Remark: The theorem is essentially proved in Sen and Singer (1993). The the- orem can be restated in an obvious way for the testing problem H0 : g(θ)= (gi(θ),... ,gr(θ)) = 0.

2 2 Example 21.11. Let X1, ... , Xn be iid N(µ, σ ). Let θ =(µ, σ )=(θ1,η), where 2 θ1 = µ and η = σ .Thusq =2andr = 1. Suppose we want to test H0 : µ = µ0 vs. 6 √∆ H1 : µ = µ0. Consider alternatives θn =(µ0 + n ,η). By a familiar calculation, Ã ! Ã ! 1 η 0 η 0 I(θ)= ⇒ V (θ)= and Vr(θ)=η. 1 2 0 2η2 02η

331 L 2 2 − −→ 2 ∆ 2 ∆ Therefore, by the above theorem 2logΛn NCχ (1, η )=NCχ (1, σ2 ) Pθn Remark: One practical use of this theorem is in approximating the that rejects H0 if −2logΛn >c.

Suppose we are interested in approximating the power of the test when θ1 = θ10 + ². √ 2 Formally, set ² = √∆ ⇒ ∆=² n. Use NCχ2(1, n² ) as an approximation to n Vr(θ10,η) the distribution of −2logΛn at θ =(θ10 +², η). Thus the power will be approximated 2 as P (NCχ2(1, n² ) >c). Vr(θ10,η)

21.5 Bartlett Correction

2 We saw that under regularity conditions −2logΛn has asymptotically a χ (r) distri- 2 bution under H0 for some r. Certainly, −2logΛn is not exactly distributed as χ (r). − In fact, even the expectation is not r. It turns out that E( 2logΛ³ n) under´ H0 admits a −1 −2logΛn an expansion of the form r(1 + n + o(n )). Consequently, E a R = r where ³ 1+ n´+ n −2logΛn −2 Rn is the remainder term in E(−2logΛn). Moreover, E a = r + O(n ). 1+ n −2logΛn Thus we gain higher order accuracy in the mean by rescaling −2logΛn to a . 1+ n For the absolutely continuous case, the higher order accuracy even carries over to the χ2 approximation itself, i.e. µ ¶ − 2logΛn 2 −2 P a ≤ c = P (χ (r) ≤ c)+O(n ); 1+ n

1 Due to the rescaling, the n term that would otherwise be present has vanished. This type of rescaling is known as the Bartlett correction. See Bartlett(1937),Wilks(1938), McCullagh and Cox(1986), and Barndorff-Nielsen and Hall(1988) for detailed treat- ment of Bartlett corrections in general circumstances.

21.6 The Wald and Rao Score Tests

Competitors to the LRT are available in the literature. They are general and can be applied to a wide selection of problems. Typically, the three procedures are asymptotically first order equivalent. See Wald(1943),and Rao(1948) for the first introduction of these procedures.

We define the Wald and the score statistics first. We have not stated these definitions under the weakest possible conditions. Also note that establishing the asymptotics

332 of these statistics will require more conditions than are needed for just defining them.

| dPθ ∈ ⊆Rq Definition 21.1. Let X1,X2,... ,Xn be iid f(x θ)= dµ ,θ Θ ,µ a σ-finite measure. Suppose the Fisher information matrix I(θ)existsatallθ. Consider ˆ ˆ 0 ˆ ˆ H0 : θ = θ0.Letθn be the MLE of θ.DefineQW = n(θn − θ0) I(θn)(θn − θ0). This is the statistic for H0 : θ = θ0.

| dPθ ∈ ⊆Rq Definition 21.2. Let X1,X2,... ,Xn be iid f(x θ)= dµ ,θ Θ ,µ a σ-finite measure. Consider testing H0 : θ = θ0. Assume that f(x|θ) is partially differen- tiable with respect to each coordinate of θ for every x, θ, and the Fisher information Pn ∂ | matrix exists and is invertible at θ = θ0.LetUn(θ)= ∂θ log f(xi θ). Define i=1 1 0 −1 QS = n Un(θ0)I (θ0)Un(θ0); QS is the Rao statistic for H0 : θ = θ0.

Remark: As we have seen in Chapter 16, the MLE θˆn is asymptotically multivariate ˆ 0 ˆ ˆ normal under the Cram´er-Rao regularity conditions. Therefore, n(θn−θ0) I(θn)(θn−

θ0) is asymptotically a central chisquare, provided I(θ)issmoothatθ0.TheWald ˆ 0 ˆ ˆ test rejects H0 when QW = n(θn − θ0) I(θn)(θn − θ0) is larger than a chisquare per- centile. score function ∂ | On the other hand, under simple moment conditions on the ∂θ log f(xi θ), U θ √n( ) by the CLT n is asymptotically multivariate normal, and therefore the Rao score 1 0 −1 statistic QS = n Un(θ0)I (θ0)Un(θ0) is asymptotically a chisquare. The score test rejects H0 when QS is larger than a chisquare . Notice that in this case of 2 asimpleH0,theχ approximation of QS holds under less assumptions than would be required for QW or −2logΛn.NotealsothattoevaluateQS, computation of θˆn is not necessary, which can be a major advantage in some applications. Here are the asymptotic chisquare results for these two statistics. A version of the two parts of the theorem below is proved in Serfling (1980) and in Sen and Singer (1993). Also see van der Vaart (1998).

Theorem 21.4.

(a) If f(x|θ) can be differentiated twice under the integral sign, I(θ0)existsand

is invertible, and {x : f(x|θ) > 0} is independent of θ, then under H0 : θ = L 2 θ0,QS −→ χq,whereq = dimension of Θ.

(b) Assume the Cram´er-Rao regularity conditions for the asymptotic multivariate normality of the MLE, and assume that the Fisher information matrix is con-

333 L 2 tinuous in a neighborhood of θ0. Then, under H0 : θ = θ0,QW −→ χq,with q as above.

Remark: QW and QS can be defined for composite nulls of the form H0 : g(θ)=

(g1(θ),... ,gr(θ)) = 0 and once again, under appropriate conditions, QW and QS 2 are asymptotically χr for any θ ∈ H0. However, now the definition of QS requires ˆ calculation of the restricted MLE θn,H0 under H0. Again, consult Sen and Singer(1993) for a proof.

Comparisons between the LRT, the Wald test and Rao’s score test have been made using higher order asymptotics. See Mukerjee and Reid(2001), in particular.

21.7 Likelihood Ratio Confidence Intervals

The usual duality between testing and confidence intervals says that the acceptance region of a test with size α can be inverted to give a confidence set of coverage probability at least (1 − α). In other words, suppose A(θ0) is the acceptance region

of a size α test for H0 : θ = θ0 and define S(x)={θ0 : x ∈ A(θ0)}.Then 3 ≥ − − Pθ0 (S(x) θ0) 1 α and hence S(x) is a 100(1 α)% confidence set for θ. This method is called the inversion of a test. In particular, each of the LRT, the Wald test, and the Rao score test can be inverted to construct confidence sets that have asymptotically a 100(1 − α)% coverage probability. The confidence sets constructed from the LRT, the Wald test,and the score test are respectively called the likelihood ratio, Wald and score confidence sets. Of these, the Wald and the score confidence sets are ellipsoids because of how the corresponding test statistics are defined. The likelihood ratio confidence set is typically more complicated. Here is an example.

iid Example 21.12. Suppose Xi ∼ Bin(1,p), 1 ≤ i ≤ n. For testing H0 : p = p0 vs.

H1 : p =6 p0, the LRT statistic is x − n−x Xn p0(1 p0) Λn = where x = Xi sup px(1 − p)n−x p i=1

x − n−x x − n−x n p0(1 p0) p0(1 p0) n = x x − x n−x = x − n−x ( n ) (1 ( n )) x (n x)

Thus the LR confidence set is of the form SLR(x):={p0 :Λn ≥ k} = {p0 : x − n−x ≥ ∗} { − − ≥ ∗} p0(1 p0) k = p0 : x log p0 +(n x)log(1 p0) log k .

334 The function x log p0 +(n − x)log(1− p0)isconcaveinp0 and therefore SLR(x) is an interval. The interval is of the form [0,u]or[l, 1] or [l, u] for 0

2 n(ˆp−p0) 2 2 The Wald confidence interval is SW = {p : ≤ χ } = {p :(ˆp − p ) ≤ p 0 pˆ(1−pˆ) p α 0 p 0 p −p ˆ(1 ˆ) 2 } { | − |≤ √χα − } − √χα − √χα − n χα = p0 : pˆ p0 n pˆ(1 pˆ) =[ˆp n pˆ(1 pˆ), pˆ + n pˆ(1 pˆ)] This is the textbook confidence interval for p. Pn Pn ∂ | ∂ For the score test statistic, we need Un(p)= ∂p log f(xi p)= ∂p[xi log p + i=1 i=1 2 x−np 1 (x−np0) (1 − xi)log(1− p)] = . Therefore, QS = and the score confidence p(1−p) n p0(1−p0) 2 (x−np0) 2 2 2 interval is SS = {p : ≤ χ } = {p :(x − np ) ≤ nχ p (1 − p )} = {p : 0 np0(1−p0) α 0 0 α 0 0 0 2 2 2 − 2 2 ≤ } p0(n + nχα) p0(2nx + nχα)+x 0 =[lS,uS], where lS,uS are the roots of the 2 2 2 − 2 2 quadratic equation p0(n + nχα) p0(2nx + nχα)+x =0.

These intervals all have the property

lim Pp (S 3 p )=1− α. n→∞ 0 0

See Brown, Cai, DasGupta (2001) for a comparison of their exact coverage proper- ties; they show that the performance of the Wald confidence interval is extremely poor, and the score and the likelihood ratio intervals perform much better.

335 21.8 Exercises

For each of the following problems 1 to 18, derive the LRT statistic and find its limiting distribution under the null, if there is a meaningful one :

iid 1 1 n ∼ 6 Exercise 21.1. . X1,X2,... ,X Ber(p),H0 : p = 2 vs.H1 : p = 2 .IstheLRT | − n | statistic a monotone function of X 2 ? iid Exercise 21.2. . X1,X2,... ,Xn ∼ N(µ, 1),H0 : a ≤ µ ≤ b, H1 :notH0. iid Exercise 21.3. .*X1,X2,... ,Xn ∼ N(µ, 1),H0 : µ is an integer, H1 : µ is not an integer.

iid Exercise 21.4. .*X1,X2,... ,Xn ∼ N(µ, 1),H0 : µ is rational, H1 : µ is irra- tional.

iid iid Exercise 21.5. . X1,X2,... ,Xn ∼ Ber(p1),Y1,Y2,... ,Ym ∼ Ber(p2),H0 : p1 = p2,H1 : p1 =6 p2; assume all m + n observations are independent. iid iid Exercise 21.6. . X1,X2,... ,Xn ∼ Ber(p1),Y1,Y2,... ,Ym ∼ Ber(p2); H0 : p1 = 1 p2 = 2 ,H1 :notH0; all observations are independent. iid iid Exercise 21.7. . X1,X2,... ,Xn ∼ Poi(µ),Y1,Y2,... ,Ym ∼ Poi(λ),H0 : µ = λ =

1,H1 :notH0; all observations are independent. iid iid ∼ 2 ∼ 2 Exercise 21.8. .*X1,X2,... ,Xn N(µ1,σ1),Y1,Y2,... ,Ym N(µ2,σ2),H0 :

µ1 = µ2,σ1 = σ2,H1 :notH0 (i.e., test that two normal distributions are identical); again, all observations are independent.

iid 2 Exercise 21.9. . X1,X2,... ,Xn ∼ N(µ, σ ),H0 : µ =0,σ =1,H1 :notH0. iid Exercise 21.10. . X1,X2,... ,Xn ∼ BV N(µ1,µ2,σ1,σ2,ρ),H0 : ρ =0,H1 : ρ =6 0.

iid Exercise 21.11. . X1,X2,... ,Xn ∼ MVN(µ, I),H0 : µ =0,H1 : µ =0.6 iid Exercise 21.12. .*X1,X2,... ,Xn ∼ MVN(µ, Σ),H0 : µ =0,H1 : µ =0.6 iid Exercise 21.13. . X1,X2,... ,Xn ∼ MVN(µ, Σ),H0 :Σ=I,H1 :Σ=6 I. indep. Exercise 21.14. .*Xij ∼ U[0,θi],j =1, 2,... ,n,i =1, 2,... ,k,H0 : θi are equal, H1 ;notH0. Remark: :NoticetheEXACT chisquare distribution.

336 indep. 2 Exercise 21.15. . Xij ∼ N(µi,σ ),H0 : µi are equal, H1 :notH0. Remark: : Notice that you get the usual F test for ANOVA.

iid α Exercise 21.16. .*X1,X2,... ,Xn ∼ c(α)Exp[−|x| ], where c(α) is the normal-

izing constant, H0 : α =2,H1 : α =2.6 iid Exercise 21.17. .(Xi,Yi),i =1, 2,... ,n ∼ BV N(µ1,µ2,σ1,σ2,ρ),H0 : µ1 =

µ2,H1 : µ1 =6 µ2. Remark: :Thisisthepaired t test.

iid iid ∼ 2 ∼ 2 Exercise 21.18. . X1,X2,... ,Xn N(µ1,σ1),Y1,Y2,... ,Ym N(µ2,σ2),H0 :

σ1 = σ2,H1 : σ1 =6 σ2; assume all observations are independent.

Exercise 21.19. * Suppose X1,X2,... ,Xn are iid from a two component mixture pN(0, 1) + (1 − p)N(µ, 1), where p, µ are unknown.

Consider testing H0 : µ =0vs.H1 : µ =6 0. What is the LRT statistic ? Simulate the value of the LRT by gradually increasing n and observe its slight decreasing trend as n increases.

Remark: : It has been shown by Jon Hartigan that in this case −2logΛn →∞.

Exercise 21.20. * Suppose X1,X2,... ,Xn are iid from a two component mixture

pN(µ1, 1) + (1 − p)N(µ2, 1), and suppose we know that |µi|≤1,i =1, 2. Consider

testing H0 : µ1 = µ2 vs.H1 : µ1 =6 µ2,whenp, 0

Remark: : The asymptotic null distribution is nonstandard, but known.

Exercise 21.21. *For each of Exercises 1, 6, 7, 9, 10, 11, simulate the exact dis-

tribution of −2logΛn for n =10, 20, 40, 100 under the null, and compare its visual match to the limiting null distribution.

Exercise 21.22. *For each of Exercises 1, 5, 6, 7, 9, simulate the exact power of the LRT at selected alternatives, and compare it to the approximation to the power obtained from the limiting nonnull distribution, as indicated in text.

Exercise 21.23. * For each of Exercises 1, 6, 7, 9, compute the constant a of the Bartlett correction of the likelihood ratio, as indicated in text.

337 iid 2 Exercise 21.24. *LetX1,X2,... ,Xn ∼ N(µ, σ ). Plot the Wald and the Rao score confidence sets for θ =(µ, σ) for a simulated data set. Observe the visual similarity of the two sets by gradually increasing n.

iid Exercise 21.25. Let X1,X2,... ,Xn ∼ Poi(µ). Compute the likelihood ratio, Wald, and Rao score confidence intervals for µ for a simulated data set. Observe the similarity of the limits of the intervals for large n.

iid Exercise 21.26. *LetX1,X2,... ,Xn ∼ Poi(µ). Derive a two-term expansion for the expected lengths of the Wald and the Rao score intervals for µ. Plot them for selected n and check if either one is usually shorter than the other interval.

Exercise 21.27. Consider the model E(Yi)=β0 + β1xi,i = 2 2 1, 2,... ,n,withYi being mutually independent, distributed as N(β0 + β1xi,σ ),σ unknown. Find the likelihood ratio confidence set for (β0,β1).

21.9 References

Barndorff-Nielsen,O. and Hall,P.(1988).On the level error after Bartlett adjustment of the likelihood ratio statistic,Biometrika,75,374-378.

Bartlett,M.(1937).Properties of sufficiency and statistical tests,Proc.Royal Soc.London, Ser A,160,268-282.

Bickel,P.J. and Doksum,K.(2001).,Basic Ideas and Se- lected Topics,Prentice Hall,Upper Saddle River,NJ.

Brown,L.,Cai,T., and DasGupta,A.(2001). for a binomial pro- portion, Statist.Sc.,16,2,101-133.

Casella,G. and Strawderman,W.E.(1980). Estimating a bounded normal mean, Ann.Stat., 9,4,870-878.

Fan,J.,Hung,H., and Wong,W.(2000).Geometric understanding of likelihood ra- tio statistics,JASA,95,451,836-841.

Fan,J. and Zhang,C.(2001).Generalized likelihood ratio statistics and Wilks’ phe- nomenon, Ann.Stat.,29,1,153-193.

Ferguson,T.S.(1996).A Course in Large Sample theory,Chapman and Hall,London.

338 Lawley,D.N.(1956).A general method for approximating the distribution of like- lihood ratio criteria,Biometrika,43,295-303.

McCullagh,P. and Cox,D.(1986).Invariants and likelihood ratio statistics,Ann.Stat., 14,4,1419-1430.

Mukerjee,R. and Reid,N.(2001).Comparison of test statistics via expected lengths of associated confidence intervals,Jour.Stat.Planning and Inf.,97,1,141-151.

Portnoy,S.(1988).Asymptotic behavior of likelihood methods for Exponential families when the number of parameters tends to infinity,Ann.Stat.,16,356-366.

Rao,C.R.(1948).Large sample tests of statistical hypotheses concerning several parameters with applications to problems of estimation,Proc.Cambridge Philos. Soc.,44,50-57.

Robertson,T.,Wright,F.T. and Dykstra,R.L.(1988). Order Restricted ,John Wiley,New York.

Sen,P.K. and Singer,J.(1993).Large Sample Methods in Statistics,An Introduc- tion with Applications,Chapman and Hall,New York.

Serfling, R. (1980). Approximation Theorems of Mathematical Statistics, Wiley, New York. van der Vaart, A. (1998). Asymptotic Statistics, Cambridge University Press, Cambridge. Wald,A.(1943).Tests of statistical hypotheses concerning several parameters when the number of observations is large,Trans.Amer.Math.Soc.,5,426-482.

Wilks,S.(1938).The large sample distribution of the likelihood ratio for testing composite hypotheses,Ann.Math.Statist.,9,60-62.

339