Chapter 10

Asymptotic Evaluations

10.1 Point Estimation

10.1.1 Consistency

The property of consistency seems to be quite a fundamental one, requiring that the converges to the correct value as the sample size becomes infinite.

Definition 10.1.1 A sequence of Wn = Wn(X1,...,Xn) is a consistent sequence of estimators of the parameter θ if, for every ² > 0 and every θ ∈ Θ,

lim Pθ(|Wn − θ| < ²) = 1. n→∞ or equivalently,

lim Pθ(|Wn − θ| ≥ ²) = 0. n→∞

169 170 CHAPTER 10. ASYMPTOTIC EVALUATIONS Note that in this definition, we are dealing with a family of prob- ability structures. It requires that for every θ, the corresponding estimator sequence will converge to θ in probability.

Recall that, for an estimator Wn, Chebychev’s Inequality states E [(W − θ)2] P (|W − θ| ≥ ²) ≤ θ n , θ n ²2 so if, for every θ ∈ Θ,

2 lim Eθ[(Wn − θ) ] = 0, n→∞ then the sequence of estimators is consistent. Furthermore,

2 2 Eθ[(Wn − θ) ] = VarθWn + [BiasθWn] .

Putting this all together, we can state the following theorem.

Theorem 10.1.1 If Wn is a sequence of estimators of a param- eter θ satisfying

i. limn→∞ VarθWn = 0,

ii. limn→∞ BiasθWn = 0, for every θ ∈ Θ, then Wn is a consistent sequence of estimators of θ. 10.1. POINT ESTIMATION 171

Example 10.1.1 (Consistency of X¯ ) Let X1,... be iid N(θ, 1), and consider the sequence 1 Xn X¯ = X . n n i i=1 Since 1 E X¯ = θ and Var X¯ = , θ n θ n n the sequence X¯n is consistent.

Theorem 10.1.2 Let Wn be a consistent sequence of estimators of a parameter θ. let a1, a2,..., and b1, b2,... be sequences of constants satisfying

i. limn→∞ an = 1, ii. limn→∞ bn = 0.

Then the sequence Un = anWn + bn is a constant sequence of estimators of θ. 172 CHAPTER 10. ASYMPTOTIC EVALUATIONS

Theorem 10.1.3 (Consistency of MLEs) Let X1,X2,..., be Qn iid f(x|θ), and let L(θ|x) = i=1 f(xi|θ) be the likelihood func- tion. Let θˆ denote the MLE of θ. Let τ(θ) be a continuous function of θ. Under regularity conditions on f(x|θ) and, hence, L(θ|x), for every ² > 0 and every θ ∈ Θ,

ˆ lim Pθ(|τ(θ) − τ(θ)| ≥ ²) = 0. n→∞

This is, τ(θˆ) is a consistent estimator of τ(θ).

For proof of the theorem, see Stuart, Ord and Arnold (1999). 10.1. POINT ESTIMATION 173

10.1.2 Efficiency

The property of consistency is concerned with the asymptotic accu- racy of an estimator: Does it converge to the parameter that it is estimating? In this section we look at a related property, efficiency, which is concerned with the asymptotic variance of an estimator. In calculating an asymptotic variance, we are, perhaps, tempted to proceed as follows. Given an estimator Tn based on a sample of size n, we calculate the finite-sample variance VarTn, and then evaluate limn→∞ knVarTn, where kn is some normalizing constant. Note that, in many cases, VarTn → 0 as n → ∞, so we need a factor kn to force it to a limit. 174 CHAPTER 10. ASYMPTOTIC EVALUATIONS

Definition 10.1.2 For an estimator Tn, if limn→∞ knVarTn =

2 2 τ < ∞, where {kn} is a sequence of constants, then τ is called the limiting variance or limit of the variances.

Example 10.1.2 (Limiting variances) For the mean X¯n of n iid normal observations with EX = µ and VarX = σ2, if we take

√ 2 Tn = X¯n, then limn→∞ nVarX¯n = σ is the limiting variance of

Tn.

Definition 10.1.3 For an estimator Tn, suppose that kn(Tn − τ(θ)) → N(0, σ2) in distribution. The parameter σ2 is called the asymptotic variance or variance of the limit distribution of Tn.

For calculations of the variances of sample means and other types of averages, the limit variance and the asymptotic variance typically have the same value. But in more complicated cases, the limiting variance will sometimes fail us. It is also interesting to note that it is always the case that the asymptotic variance is smaller than the limiting variance. 10.1. POINT ESTIMATION 175 Example 10.1.3 (Large-sample mixture variances) Con- sider a mixture model, where we observe Yn ∼ N(0, 1) with prob-

2 ability pn and Yn ∼ N(0, σn) with probability 1 − pn. First, with the formula Var(X) = E(Var(X|Y ))+Var(E(X|Y )) we have

2 Var(Yn) = pn + (1 − pn)σn.

It then follows that the limiting variance of Yn is finite only if

2 limn→∞(1 − pn)σn < ∞.

On the other hand, the asymptotic distribution of Yn can be directly calculated using

P (Yn < a) = pnP (Z < a) + (1 − pn)P (Z < a/σn).

Suppose now we let pn → 1 and σn → ∞ in such a way that

2 (1 − pn)σn → ∞. It then follows that P (Yn < a) → P (Z < a), that is, Yn → N(0, 1), and we have

2 limiting variance = lim pn + (1 − pn)σn = ∞, n→∞ asymptotic variance = 1. 176 CHAPTER 10. ASYMPTOTIC EVALUATIONS

Definition 10.1.4 A sequence of estimators Wn is asymptoti- √ cally efficient for a parameter τ(θ) if n[Wn−τ(θ)] → N(0, ν(θ)) in distribution and [τ 0(θ)]2 ν(θ) = ¡ ¢; ∂ 2 Eθ (∂θ log f(X|θ)) that is, the asymptotic variance of Wn achieves the C-Rao lower bound.

Theorem 10.1.4 (Asymptotic efficiency of MLEs) Let X1,..., be iid f(x|θ), let θˆ denote the MLE of θ, and let τ(θ) be a con- tinuous function of θ. Under the regularity conditions on f(x|θ) and , hence, L(θ|x), √ n[τ(θˆ) − τ(θ)]] → N(0, ν(θ)], where ν(θ) is the C-Rao lower bound. That is, τ(θˆ) is a consis- tent and asymptotically efficient estimator of τ(θ). 10.1. POINT ESTIMATION 177 P Proof: Recall that l(θ|x) = log f(xi|θ) is the log likelihood function. Denote derivatives (with respect to θ) by l0, l00,.... Now expand the first derivatives of the log likelihood around the true value

θ0,

0 0 00 l (θ|x) = l (θ0|x) + (θ − θ0)l (θ0|x) + ··· .

Now substitute the MLE θˆ for θ, realize that l0(θˆ) = 0. Rearranging √ and multiplying through by n gives us

1 0 √ √ 0 −√ l (θ0|x) ˆ −l (θ0|x) n n(θ − θ0) = n 00 = 1 00 . l (θ0|x) nl (θ0|x) 0 2 Let I(θ0) = E[l (θ0|X)] denote the information number for one observation. We can see that

1 √ £ 1 X ¤ √ l0(θ |x) = n W , n 0 n i i

d where Wi = (dθf(Xi|θ))/f(Xi|θ) has mean 0 and variance I(θ0). By the , we have

1 −√ l0(θ |x) → N[0,I(θ )]. (in distribution) n 0 0

Write d2 1 00 1 X 1 X f(xi|θ) − l (θ |x) = W − dθ2 , n 0 n i n f(X |θ) i i i 178 CHAPTER 10. ASYMPTOTIC EVALUATIONS

2 where the mean of Wi is I(θ0) and the mean of the second term is 0. Apply WLLN, we have

1 00 l (θ |x) → I(θ ) (in probability) n 0 0 By Slutsky’s theorem, we have √ ˆ 1 n(θ − θ0) → N(0, . I(θ0)

Now assume that τ(θ) is differentiable at θ = θ0. By the , we have √ n[τ(θˆ) − τ(θ)] → N(0, ν(θ)].

Since √ τ(θˆ) − τ(θ) n p → Z in distribution, ν(θ) where Z ∼ N(0, 1). By applying Slutsky’s theorem, we conclude p p ¡ ν(θ)¢¡√ τ(θˆ) − τ(θ)¢ ¡ ν(θ)¢ τ(θˆ) − τ(θ) = √ n p → lim √ Z = 0, n ν(θ) n→∞ n So τ(θˆ) − τ(θ) → 0 in distribution. From theorem 5.5.13 we know that convergence in distribution to a point is equivalent to conver- gence in probability, so τ(θˆ) is a consistent estimator of µ. ¤ 10.1. POINT ESTIMATION 179

10.1.3 Calculations and Comparisons

If an MlE is asymptotically efficient, the asymptotic variance in The- orem 10.1.3 is the Delta method variance of Theorem 5.5.24 (with- out the 1/n term). Thus, we can use the C-Rao lower bound as an approximation to the true variance of the MLE. Suppose that ˆ ∂ X1,...,Xn are iid f(x|θ), θ is the MLE of θ, and In(θ) = Eθ(∂θ log L(θ|X)) is the information number of the sample. From the Delta method and asymptotic efficiency of MLEs, the variance of h(θˆ) can be ap- proximated by

[h0(θ)]2 [h0(θ)]2 [h0(θ)]2| Var(h(θˆ)|θ) ≈ = ≈ θ=θˆ , ∂ ∂2 In(θ) E ( log L(θ|X)) θ ∂θ −∂θ2 log L(θ|X)|θ=θˆ (10.1)

∂2 where −∂θ2 log L(θ|X)|θ=θˆ is called the observed information num- ber. Efron and Hinkley (1978) have shown that use of the observed information number is superior to the expected information number in this case. Note that the variance estimation process is a two-step procedure. First we approximate Var(h(θˆ)|θ), then we estimate the resulting ap- proximation, usually by substituting θˆ for θ. The resulting estimate ˆ c ˆ can be denoted by Varθˆh(θ) or Varθh(θ). It follows from Theorem 180 CHAPTER 10. ASYMPTOTIC EVALUATIONS

1 ∂2 10.1.3 that −n ∂θ2 log L(θ|X)|θ=θˆ is a consistent estimator of I(θ), ˆ ˆ so it follows that Varθˆh(θ) is a consistent estimator of Var(h(θ)|θ). 10.1. POINT ESTIMATION 181 Example 10.1.4 (Approximate binomial variance) Suppose we have a random sample X1,...,Xn from a Bernoulli(p) popu- P lation. It is easy to show that pˆ = Xi/n is the MLE of p. By direct calculation we know that p(1 − p) Var (ˆp) = , p n and a reasonable estimate of Varp(ˆp) is pˆ(1 − pˆ) Vard (ˆp) = . p n If we apply the approximation in (10.1), with h(p) = p, we get as an estimate of Varp(ˆp), 1 Vard (ˆp) ≈ . p ∂2 −∂p2 log L(p|x)|p=ˆp Recall that

log L(p|x) = npˆlog(p) + n(1 − pˆ) log(1 − p) and ∂2 npˆ n(1 − pˆ) n log L(p|x)| = − − = − , ∂p2 p=ˆp pˆ2 (1 − pˆ)2 pˆ(1 − pˆ) which gives a variance approximation identical to the previous one. 182 CHAPTER 10. ASYMPTOTIC EVALUATIONS Now we move to a more complicated case, estimating the vari- ance of p/ˆ (1 − pˆ). Its variance can be estimated by £ ¡ ¢¤ £ ¤ ∂ p 2 (1−p)+p 2 ¡ pˆ ¢ |p=ˆp 2 |p=ˆp pˆ Vard = ∂p 1−p = (1−p) = . ∂2 n 3 1 − pˆ |p=ˆp n(1 − pˆ) −∂p2 log L(p|x)|p=ˆp p(1−p) Moreover, we also know that the estimator is asymptotically ef- ficient. 10.1. POINT ESTIMATION 183 The MLE variance approximation works well in many cases, but it is not infallible. In particular, we must be careful when the function h(θˆ) is not monotone. In such cases, the derivative h0 will have a sign change, and that may lead to an underestimated variance approximation.

Example 10.1.5 (Continuation of Example 10.1.4) Sup- pose now that we want to estimate the variance of the Bernoulli distribution, (p(1 − p). The MLE of this variance is given by pˆ(1 − pˆ), and an estimate of the variance of this estimator can be obtained by applying the approximation (10.1). We have £ ¤ ∂ 2 (p(1 − p)) |p=ˆp (1 − 2p)2| pˆ(1 − pˆ)(1 − 2ˆp)2 Vard(ˆp(1−pˆ)) = ∂p = p=ˆp = , ∂2 n |p=ˆp n −∂p2 log L(p|x)|p=ˆp p(1−p) 1 which can be 0 if pˆ = 2, a clear underestimate of the variance of pˆ(1 − pˆ). The fact that the function p(1 − p) is not monotone is a cause of this problem. 184 CHAPTER 10. ASYMPTOTIC EVALUATIONS The property of asymptotic efficiency gives us a benchmark for what we can hope to attain in asymptotic variance. We also can use the asymptotic variance as a means of comparing estimators, through the idea of asymptotic relative efficiency.

Definition 10.1.5 If two estimators Wn and Vn satisfy

√ 2 n[Wn − τ(θ)] → N(0, σW )

√ 2 n[Vn − τ(θ)] → N(0, σV ) in distribution, the asymptotic relative efficiency (ARE) of Vn with respect to Wn is 2 σW ARE(Vn,Wn) = 2 . σV 10.1. POINT ESTIMATION 185 Example 10.1.6 (AREs of Poisson estimators) Suppose that

X1,X2,... are iid Poisson(λ), and we are interested in estimat- ing the 0 probability, i.e., p(X = 0) = e−λ.

A natural estimator comes from defining Yi = I(Xi = 0) and using 1 Xn τˆ = Y . n i i=1 −λ The Yis are Bernoulli(e ), and hence it follows that e−λ(1 − e−λ) E(ˆτ) = e−λ and Var(ˆτ) = . n P −λ −λˆ ˆ Alternatively, the MLE of e is e , where λ = i Xi/n is the MLE of λ. Using Delta method approximation, we have that

−2λ ˆ ˆ λe E(e−λ) ≈ e−λ and Var(e−λ) ≈ . n Since √ n(ˆτ − e−λ) → N(0, e−λ(1 − e−λ))

√ ˆ n(e−λ − e−λ) → N(0, λe−2λ) in distribution, the ARE of τˆ with respect to the MLE e−λˆ is

−2λ ˆ λe λ ARE(ˆτ, e−λ) = = . e−λ(1 − e−λ) eλ − 1 Examination of this function shows that it is strictly decreasing 186 CHAPTER 10. ASYMPTOTIC EVALUATIONS with a maximum of 1 attained at λ = 0 and tailing off rapidly (being less than 10% when λ = 4) to asymptote to 0 as λ → ∞.

Since the MLE is asymptotically efficient, another estimator can- not hope to beat its asymptotic variance. However, other estimators may have other desirable properties (ease of calculation, robustness to underlying assumptions) that make them desirable. In such situ- ations, the efficiency of the MLE becomes important in calibrating what we are giving up if we use an alternative estimator. 10.2. HYPOTHESIS TESTING 187

10.2 Hypothesis Testing

10.2.1 Asymptotic Distribution of LRTs

One of the most useful methods for complicated models is the like- lihood ratio method of test construction because it gives an explicit definition of the test statistic, sup L(θ|x) λ(x) = Θ0 , supΘ L(θ|x) and an explicit form for the rejection region, {x : λ(x) ≤ c}. To define a level α test, the constant c must be chosen so that

sup Pθ(λ(x) ≤ c) ≤ α. (10.2) θ∈Θ0 If we cannot derive a simple formula for λ(x), it might seem that it is hopeless to derive the of λ(x) and thus know how to pick c to ensure (10.2). However, if we appeal to asymptotics, we can get an approximate answer. 188 CHAPTER 10. ASYMPTOTIC EVALUATIONS Theorem 10.2.1 (Asymptotic distribution of the LRT- simple H0) For testing H0 : θ = θ0 versus H1 : θ 6= θ0, suppose ˆ X1,...,Xn are iid f(x|θ), θ is the MLE of θ, and f(x|θ) satisfies the regularity conditions. Then under H0, as n → ∞,

2 −2 log λ(X) → χ1 in distribution,

2 2 where χ1 is a χ with 1 degree of freedom.

Proof: First expand log L(θ|x) = l(θ|x) in a Taylor series around θˆ, giving

2 00 (θ − θˆ) l(θ|x) = l(θˆ|x) + l0(θˆ|x)(θ − θˆ) + l (θˆ|x) + ··· . 2! Thus, we have

2 00 (θ − θˆ) l(θ |x) ≈ l(θˆ|x) + l0(θˆ|x)(θ − θˆ) + l (θˆ|x) 0 0 0 2! and

ˆ 00 ˆ ˆ 2 −2 log λ(x) = −2l(θ0|x) + 2l(θ|x) ≈ −l (θ|x)(θ0 − θ) ,

0 ˆ 00 ˆ ˆ 2 where we use the fact that +l (θ|x) = 0. Since −l (θ|x)(θ0 − θ) is ˆ ˆ 1 ˆ ˆ the observed information In(θ) and nIn(θ) → I(θ0). It follows from 2 Theorem 10.1.4 and Slutsky’s Theorem that −2 log λ(X) → χ1. ¤ 10.2. HYPOTHESIS TESTING 189

Table 10.1: Simulated (exact) and approximate percentiles of the Poisson LRT statistic ( n = 25 and λ0 = 5)

Percentile 0.80 0.90 0.95 0.99 Simulated 1.630 2.726 3.744 6.304 2 χ1 1.642 2.706 3.841 6.635

Example 10.2.1 (Poisson LRT) For testing H0 : λ = λ0 ver- sus H1 : λ 6= λ0 based on observing X1,...,Xn iid Poisson(λ), we have P x ¡ −nλ0 i ¢ e λ0 ˆ ˆ ˆ −2 log λ(x) = −2 log ˆ P = 2n[(λ0 − λ) − λ log(λ0/λ)], e−nλλˆ xi ˆ P where λ = xi/n is the MLE of λ. Applying Theorem 10.2.1,

2 we would reject H0 at level α if −2 log λ(x) > χ1,α. 2 A comparison of the simulated (exact) and χ1 (approximate) cutoff points in the following table shows that the cutoff are re- markably similar. 190 CHAPTER 10. ASYMPTOTIC EVALUATIONS Theorem 10.2.1 can be extended to the cases where the null hy- pothesis concerns a vector of parameters.

Theorem 10.2.2 Let X1,...,Xn be a random sample from a pdf or pmf f(x|θ). Under the regularity conditions, if θ ∈ Θ0, then the distribution of the statistic −2 log λ(X) converges to a chi2 distribution as the sample size n → ∞. The degree of freedom of the limiting distribution is the difference between the number of free parameters specified by θ ∈ Θ0 and the number of free parameters specified by θ ∈ Θ.

Rejection of H0 : θ ∈ Θ0 for small values of λ(X) is equivalent to rejection for large values of −2 log λ(X). Thus,

2 H0 is rejected if and only if −2 log λ(X) ≥ χν,α, where ν is the degrees of freedom specified in 10.2.1. The type I error probability will be approximately α if θ ∈ Θ0 and the sample size is large. In this way, (10.2) will be approximately satisfied for large sample sizes and an asymptotic size α test has been defined. Note that the theorem will actually imply only that

lim Pθ(reject H0) = α, for each θ ∈ Θ0, n→∞ 10.2. HYPOTHESIS TESTING 191 not that the supθ∈Θ0 Pθ(reject H0) converges to α. This is usually the case for asymptotic size α tests. The computation of the degrees of freedom for the test statistic is usually straightforward. Most often, Θ can be represented as a subset of q-dimensional Euclidean space that contains an open subset in Rq, and Θ0 can be represented as a subset of p-dimensional Euclidean space that contains an open subset in Rp, where p < q. Then q − p = ν is the degrees of freedom for the test statistic. 192 CHAPTER 10. ASYMPTOTIC EVALUATIONS

Example 10.2.2 (Multinomial LRT) Let θ = (p1, p2, p3, p4, p5), where the pjs are nonnegative and sum to 1. Suppose X1,...,Xn are iid discrete random variables and Pθ(Xi = j) = pj, j =

1,..., 5. Thus the pmf of Xi is f(j|θ) = pj and the likelihood function is Yn y1 y2 y3 y4 y5 L(θ|x) = f(xi|θ) = p1 p2 p3 p4 p5 , i=1 where yj is the number of x1, . . . , xn equal to j. Consider testing

H0 : p1 = p2 = p3 and p4 = p5 versus H1 : H0 is not true.

The full parameter space, Θ, is really a four-dimensional set, since p5 = 1 − p1 − p2 − p3 − p4. Thus q = 4. There is only one free parameter in the set Θ0 because, once p1 is fixed, the others can also be determined. Thus p = 1, and the degrees of freedom is ν = 4 − 1 = 3.

Simple calculations show that the MLE of pj under Θ is pˆj = yj/n, and the MLE of p1 is pˆ10 = (y1 + y2 + y3)/(3n). Thus, we have

y + y + y y + y + y y + y + y y + y y + y λ(x) = ( 1 2 3 )y1( 1 2 3 )y2( 1 2 3 )y3( 4 5 )y4( 4 5 )y5. 3y1 3y2 3y3 2y4 2y5 10.2. HYPOTHESIS TESTING 193 Thus the test statistic is X5 ¡ y ¢ −2 log λ(x) = 2 y log i , i m i=1 i where m1 = m2 = m3 = (y1 + y2 + y3)/3 and m4 = m5 = (y4 +

2 y5)/2. The asymptotic size α test rejects H0 if −2 log λ(x) ≥ χ3,α. 194 CHAPTER 10. ASYMPTOTIC EVALUATIONS

10.2.2 Other Large-Sample Tests

Another common method of constructing a large-sample test statistic is based on an estimator that has an asymptotic . Suppose we wish to test a hypothesis about a real-valued parameter

θ, and Wn = W (X1,...,Xn) is a point estimator of θ, based on a sample of size n. For example, Wn might be the MLE of θ. An approximate test, based on a normal approximation, can be justified

2 in the following way. If σn denotes the variance of Wn and if we can use some form of the central limit theorem to show that, as n → ∞,

(Wn − θ)/σn converges in distribution to a standard normal random variable, then (Wn−θ)/σn can be compared to a N(0, 1) distribution. We therefore have the basis for an approximate test.

In some instances, σn also depends on unknown parameters. In such a case, we look for an estimate Sn of σn with the property that

σn/Sn converges in probability to 1. Then, using Slutsky’s Theorem we can deduce that (Wn − θ)/Sn also converges in distribution to a standard normal distribution. A large-sample test may be based on this fact.

Suppose we wish to test the two-sided hypothesis H0 : θ = θ0 10.2. HYPOTHESIS TESTING 195 versus H1 : θ 6= θ0. An approximate test can be based on the statistic Zn = (Wn − θ0)/Sn and would reject H0 if and only if

Zn < −zα/2 or Zn > zα/2. If H0 is true, then θ = θ0 and Zn converges in distribution to Z ∼ N(0, 1). Thus, the type I error probability,

Pθ0(Zn < −zα/2 or Zn > zα/2) → Pθ0(Z < −zα/2 or Z > zα/2) = α, and this is an asymptotically size α test.

Now consider an alternative parameter value θ 6= θ0. We can write

Wn − θ0 Wn − θ θ − θ0 Zn = = + . Sn Sn Sn

No matter what the value of θ, the term (Wn − θ)/Sn → N(0, 1).

Typically, it is also the case that σn → 0 as n → ∞. Thus, Sn will converge in probability to 0 and the term (θ−θ0)/Sn will converge to

+∞ or −∞ in probability, depending on whether (θ −θ0) is positive or negative. Thus,

Pθ(reject H0) = Pθ(Zn < −zα/2 or Zn > zα/2) → 1 as n → ∞.

In this way, a test with asymptotic size α and asymptotic power 1 can be constructed.

If we wish to test the one-sided hypothesis H0 : θ ≤ θ0 versus 196 CHAPTER 10. ASYMPTOTIC EVALUATIONS

H1 : θ > θ0, a similar test might be constructed. Again, the test statistic Zn = (Wn −θ0)/Sn would be used and the test would reject

H0 if and only if Zn > zα. Using reasoning similar to the above, we could conclude that the power function of this test converges to 0,

α, or 1 according as θ < θ0, θ = θ0, or θ > θ0. Thus this test has reasonable asymptotic power properties. In general, a is a test based on a statistic of the form

Wn − θ0 Zn = , Sn where θ0 is a hypothesized value of the parameter θ, Wn is an esti- mator of θ, and Sn is a standard error for Wn, an estimate of the standard deviation of Wn. For example, if Wn is the MLE of θ, then, p 1/ I (W ) is a reasonable standard error for W , and it can be n n q n estimated by 1/ Iˆn(Wn), where ∂2 Iˆ (W ) = − log L(θ|X)| n n ∂θ2 θ=Wn is the observed information number. 10.2. HYPOTHESIS TESTING 197

Example 10.2.3 (Large-sample binomial tests) Let X1,...,Xn be a random sample from a Bernoulli(p) population. Consider testing H0 : p ≤ p0 versus H1 : p > p0, where 0 < p0 < 1 is a specified value. The MLE of p, based on a sample of size n, Pn is pˆn = i=1 Xi/n. The Central Limit Theorem states that for any p, 0 < p < 1, (ˆpn − p)/σn converges to a standard normal p random variable. Here σn = p(1 − p)/n, a value that depends on the unknown parameter p. A reasonable estimate of σn is p Sn = pˆn(1 − pˆn)/n, and it can be shown that σn/Sn converges in probability to 1. Thus, for any p, 0 < p < 1,

pˆ − p n → N(0, 1). pˆn(1−pˆn) n

The Wald test statistic Zn is defined by replacing p by p0, and the large-sample Wald test rejects H0 if Zn > zα. As an alternative estimate of σn, it is easily checked that 1/In(ˆpn) =p ˆn(1 − pˆn)/n.

If there was interest in testing the two-sided hypothesis H0 : p = p0 versus H1 : p 6= p0, where 0 < p0 < 1 is a specified value, the above strategy is again applicable. However, in this case, there is an alternative approximate test. By central limit 198 CHAPTER 10. ASYMPTOTIC EVALUATIONS theorem, for any p, 0 < p < 1, pˆ − p p n → N(0, 1). p(1 − p)/n Therefore, if the null hypothesis is true, the statistic

0 pˆn − p0 Zn = p ∼ N(0, 1) (approximately). p0(1 − p0)/n 0 The approximate level α test rejects H0 if |Zn| > zα/2. 10.3. INTERVAL ESTIMATION 199

10.3 Interval Estimation

10.3.1 Approximate Maximum Likelihood Intervals

With Theorem 10.1.4, we have a general method to get an asymptotic distribution for a MLE. Hence, we have a general method to construct a confidence interval. ˆ If X1,...,Xn are iid f(x|θ) and θ is the MLE of θ, then the variance of a function h(θˆ) can be approximated by [h0(θ)]2| Var(c h(θˆ)|θ) ≈ θ=θˆ . ∂2 −∂θ2 log L(θ|x)|θ=θˆ Now, for a fixed but arbitrary value of θ, we are interested in the asymptotic distribution of h(θˆ) − h(θ) q . Var(c h(θˆ)|θ) It follows from Theorem 10.1.4 and Slutsky’s Theorem that [h0(θ)]2| Var(c h(θˆ)|θ) ≈ θ=θˆ → N(0, 1), ∂2 −∂θ2 log L(θ|x)|θ=θˆ giving the approximate confidence interval q q ˆ c ˆ ˆ c ˆ h(θ) − zα/2 Var(h(θ)|θ) ≤ h(θ) ≤ h(θ) + zα/2 Var(h(θ)|θ). 200 CHAPTER 10. ASYMPTOTIC EVALUATIONS

Example 10.3.1 We have a random sample X1,...,Xn from a Bernoulli(p) population. We saw that we could estimate the odds ratio p/(1 − p) by its MLE p/ˆ (1 − pˆ) and that this estimate has approximate variance pˆ pˆ Vard( ) ≈ . 1 − pˆ n(1 − pˆ)3 We therefore can construct the approximate confidence interval s s pˆ pˆ p pˆ pˆ − z Vard( ) ≤ ≤ + z Vard( ). 1 − pˆ α/2 1 − pˆ 1 − p 1 − pˆ α/2 1 − pˆ In Section 10.3 we derived another likelihood test based on the fact that −2 log λ(X) has an asymptotic Chi squared distribution. ˆ This suggests that if X1,...,Xn are iid f(x|θ) and θ is the MLE of θ, then the set

© ¡L(θ|x)¢ 2 ª θ : −2 log ≤ χ1,α L(θˆ|x) is am approximate 1 − α confidence interval. This is indeed the case and gives us yet another approximate likelihood interval.

Pn Example 10.3.2 (Binomial LRT interval) For Y = i=1 Xi, where each Xi is an independent Bernoulli(p) random variable, we have the approximate 1 − α confidence set © ¡py(1 − p)n−y ª p : −2 log ≤ χ2 . pˆy(1 − pˆ)n−y 1,α 10.3. INTERVAL ESTIMATION 201

10.3.2 Other Large-Sample Intervals

Most approximate confidence intervals are based on either finding approximate (or asymptotic) pivots or inverting approximate level α test . If we have any statistics W and V and a parameter θ such that, as n → ∞, W − θ → N(0, 1), V then we can form the approximate confidence interval for θ given by

W − zα/2V ≤ θ ≤ W + zα/2V. 202 CHAPTER 10. ASYMPTOTIC EVALUATIONS

Example 10.3.3 (Approximate interval) If X1,...,Xn are iid with mean µ and variance σ2, then, from the central limit theorem, X¯ − µ → N(0, 1). σ/sqrtn Moreover, from Slutsky’s Theorem, if S2 → σ2 in probability, then X¯ − µ √ → N(0, 1), S/ n giving the approximate 1 − α confidence interval √ √ x¯ − zα/2s/ n ≤ µ ≤ x¯ + zα/2s/ n. 10.3. INTERVAL ESTIMATION 203 In the above example, we could get an approximate confidence interval without specifying the form of the sampling distribution. We should be able to do better when we do specify the form.

Example 10.3.4 (Approximate Poisson Interval) If X1,...,Xn are iid Poisson(λ), then we know that X¯ − λ √ → N(0, 1). S/ n However, this is true even if we did not sample from a Pois- son population. Using the Poisson assumption, we know that Var(X) = λ = EX¯ and X¯ is a good estimator of λ. Thus, us- ing the Poisson assumption, we could also get an approximate confidence interval from the fact that X¯ − λ √ → N(0, 1). X/¯ n We can use the Poisson assumption in another way. Since Var(X) = λ, it follows that X¯ − λ p → N(0, 1). λ/n Generally speaking, a reasonable rule of thumb is to use as few estimates and as many parameters as possible in an approximation. This is sensible for a very simple reason. Parameters are fixed and 204 CHAPTER 10. ASYMPTOTIC EVALUATIONS do not introduce any added variability into an approximation, while each statistic brings more variability along with it.