<<

1 Chapter 4 Efficient Likelihood Estimation and Related Tests

1. Maximum likelihood and efficient likelihood estimation

2. Likelihood ratio, Wald, and Rao (or score) tests

3. Examples

4. Consistency of Maximum Likelihood Estimates

5. The EM algorithm and related methods

6. Nonparametric MLE

7. Limit theory for the statistical agnostic: P/ ∈P 2 Chapter 4

Efficient Likelihood Estimation and Related Tests

1Maximumlikelihoodandefficientlikelihoodestimation We begin with a brief discussion of Kullback - Leibler information.

Definition 1.1 Let P be a probability measure, and let Q be a sub-probability measure on (X, ) A with densities p and q with respect to a sigma-finite measure µ (µ = P + Q always works). Thus P (X)=1andQ(X) 1. Then the Kullback - Leibler information K(P, Q)is ≤ p(X) (1) K(P, Q) E log . ≡ P q(X) ! " Lemma 1.1 For a probability measure Q and a (sub-)probability measure Q,theKullback-Leibler information K(P, Q)isalwayswell-defined,and [0, ]always K(P, Q) ∈ ∞ =0 ifandonlyifQ = P. ! Proof. Now log 1 = 0 if P = Q, K(P, Q)= log M>0ifP = MQ, M > 1 . ! If P = MQ,thenJensen’sinequalityisstrictandyields % q(X) K(P, Q)=E log P − p(X) # $ q(X) > log E = log E 1 − P p(X) − Q [p(X)>0] # $ log 1 = 0 . ≥− 2

Now we need some assumptions and notation. Suppose that the model is given by P = P : θ Θ . P { θ ∈ } 3 4 CHAPTER 4. EFFICIENT LIKELIHOOD ESTIMATION AND RELATED TESTS

We will impose the following hypotheses about : P Assumptions:

A0. θ = θ implies P = P . % ∗ θ % θ∗ A1. A x : p (x) > 0 does not depend on θ. ≡{ θ } A2. P has density p with respect to the σ finite measure µ and X ,...,X are i.i.d. P P . θ θ − 1 n θ0 ≡ 0 Notation: n L(θ) L (θ) L(θ X) p (X ) , ≡ n ≡ | ≡ θ i i=1 % n l(θ)=l(θ X) l (θ) log L (θ)= log p (X ) , | ≡ n ≡ n θ i &i=1 l(B) l(B X) ln(B)=supl(θ X) . ≡ | ≡ θ B | ∈ Here is a preliminary result which motivates our definition of the maximum likelihood .

Theorem 1.1 If A0 - A2 hold, then for θ = θ % 0 n 1 Ln(θ0) 1 pθ0 (Xi) log = log a.s. K(Pθ0 ,Pθ) > 0 , n Ln(θ) n pθ(Xi) → # $ &i=1 and hence

P (L (θ X) >L (θ X)) 1asn . θ0 n 0| n | → →∞

Proof. The first assertion is just the strong ; note that

pθ0 (X) Eθ0 log = K(Pθ0 ,Pθ) > 0 pθ(X) by lemma 1.1 and A0. The second assertion is an immediate consequence of the first. 2

Theorem 1.1 motivates the following definition.

Definition 1.2 The value θ = θ of θ which maximizes the likelihood L(θ X), if it exists and is n | unique, is the maximum likelihood estimator (MLE) of θ.ThusL(θ)=L(Θ) or l(θn)=l(Θ). ' ' Cautions: ' ' θ may not exist. • n θ may exist, but may not be unique. • 'n Note that the definition depends on the version of the density p which is selected; since this • ' θ is not unique, different versions of pθ lead to different MLE’s 1. MAXIMUM LIKELIHOOD AND EFFICIENT LIKELIHOOD ESTIMATION 5

When Θ Rd,theusualapproachtofindingθ is to solve the likelihood (or score)equations ⊂ n (2) l˙(θ X) l˙ (θ)=0; | ≡ n ' i.e. l˙ (θ X)=0,i =1,...,d.Thesolutionθ say, may not be the MLE, but may yield simply a θi | n local maximum of l(θ). The likelihood ratio for testing H( : θ = θ versus K : θ = θ is 0 % 0

L(Θ) supθ Θ L(θ X) L(θn) λn = = ∈ | = , L(θ0) L(θ0 X) L(θ0) | ' L(θn) λn = . L(θ0) ( Write P(, E for P , E .Herearesomemoreassumptionsaboutthemodel which we will use 0 0 θ0 θ0 P to treat these and test . Assumptions, continued:

A3. ΘcontainsanopenneighborhoodΘ Rd of θ for which: 0 ⊂ 0 (i) For µ a.e. x, l(θ x) log p (x)istwicecontinuouslydifferentiableinθ. | ≡ θ ··· ··· (ii) For a.e. x,thethirdorderderivativesexistand l jkl (θ x)satisfy l jkl (θ x) Mjkl(x) | | | |≤ for θ Θ for all 1 j, k, l d with E M (X) < . ∈ 0 ≤ ≤ 0 jkl ∞ A4. (i) E l˙ (θ X) =0forj =1,...,d. 0{ j 0| } (ii) E l˙2(θ X) < for j =1,...,d. 0{ j 0| } ∞ (iii) I(θ )=( E ¨l (θ X) )ispositivedefinite. 0 − 0{ jk 0| } Let n 1 1 Z l˙(θ X )andl(θ X)=I− (θ )l˙(θ X) , n ≡ √n 0| i 0| 0 0| &i=1 so that ( n 1 1 I− (θ )Z = l(θ X ) . 0 n √n 0| i &i=1 ( Theorem 1.2 Suppose that X ,...,X are i.i.d. P with density p where satisfies A0 - 1 n θ0 ∈P θ0 P A4. Then:

(i) With probability converging to 1 there exist solutions θn of the likelihood equations such that

θn p θ0 when P0 = Pθ0 is true. → ( (ii) θ is asymptotically linear with influence function l(θ x). That is, n( 0| n ( 1 1( √n(θ θ )=I− (θ )Z + o (1) = l(θ X )+o (1) n − 0 0 n p √n 0| i p i=1 1 &1 ( I− (θ )Z D N (0,I− (θ()) . →d 0 ≡ ∼ d 0 6 CHAPTER 4. EFFICIENT LIKELIHOOD ESTIMATION AND RELATED TESTS

(iii)

T 1 T 2 2logλ Z I− (θ )Z = D I(θ )D χ . n →d 0 0 ∼ d

(iv) (

T T T 1 2 W √n(θ θ ) I (θ )√n(θ θ ) D I(θ )D = Z I− (θ )Z χ , n ≡ n − 0 n n n − 0 →d 0 0 ∼ d where ( ' ( (

I(θn) , or 1 n ˙ ˙ T In(θn)= n− i=1 l(θn Xi)l(θn Xi) , or  ( 1 n ¨ | |  n− i=1 l(θn Xi) . ' ( − - ( | (  - (v)  (

T 1 T 1 2 R Z I− (θ )Z Z I− (θ )Z χ . n ≡ n 0 n → 0 ∼ d

Here we could replace I(θ0)byanyofthepossibilitiesforIn(θn)givenin(iv)andthecon- clusion continues to hold. ' ( (vi) The model satisfies the LAN condition at θ : P 0

1/2 T 1 T l(θ + n− t) l(θ )=t Z t I(θ )t + o (1) 0 − 0 n − 2 0 P0 1 tT Z tT I(θ )t N( (1/2)σ2,σ2) →d − 2 0 ∼ − 0 0 where σ2 tT I(θ )t.Notethat 0 ≡ 0 1/2 √n(θˆn θ0)=tˆn =argmaxln(θ0 + n− t) ln(θ0) − { T T − } 1 d argmax t Z (1/2)t I(θ0)t = I− (θ0)Z → {1 − } N (0,I− (θ )). ∼ d 0 Remark 1.1 Note that the asymptotic form of the log-likelihood given in part (vi) of theorem 1.2 is exactly the log-likelihood ratio for a normal mean model Nd(I(θ0)t, I(θ0)). Also note that

T 1 T 1 T 1 1 1 T 1 t Z t I(θ )t = Z I− (θ )Z (t I− (θ )Z) I(θ )(t I− (θ )Z) , − 2 0 2 0 − 2 − 0 0 − 0 1 T 1 which is maximized as a function of t by t = I− (θ0)Z with maximum value Z I− (θ0)Z/2.

Corollary 1 Suppose that A0-A4 hold and that ν ν(P )=q(θ)isdifferentiableatθ Θ. ' ≡ θ 0 ∈ Then ν q(θ )satisfies n ≡ n n 1 T 1 (√n(˜ν ( ν )= ˜l (θ X )+o (1) N(0, q˙ (θ )I− (θ )˙q(θ )) . n − 0 √n ν 0| i p →d 0 0 0 &i=1 where l (θ X )=q ˙T (θ )I 1(θ )l˙(θ X )andν q(θ ). ν 0| i 0 − 0 0| i 0 ≡ 0 ( 1. MAXIMUM LIKELIHOOD AND EFFICIENT LIKELIHOOD ESTIMATION 7

If the likelihood equations (2) are difficult to solve or have multiple roots, then it is possible to use a one-step approximation. Suppose that θn is a preliminary estimator of θ and set

1 1 (3) θˇ θ + I− (θ )(n− l˙(θ X)) . n ≡ n n n n| ˇ The estimator θn is' sometimes called a one-step estimator.

Theorem 1.3 Suppose that A0-A4 hold, and that θ satisfies n1/4(θ θ )=o (1); note that the n n − 0 p latter holds if √n(θ θ )=O (1). Then n − 0 p 1 1 √n(θˇ θ )=I− (θ )Z + o (1) N (0,I− (θ )) n − 0 0 n p →d d 0 where Z n 1/2 n l˙(θ X ). n ≡ − i=1 0| i - Proof. Theorem 1.2.(i)Existenceandconsistency.Fora>0, let

Q θ Θ: θ θ = a . a ≡{ ∈ | − 0| } We will show that

(a) P l(θ)

This implies that L has a local maximum inside Qa.Sincethelikelihoodequationsmustbesatisfied at a local maximum, it will follow that for any a>0withprobabilityconvergingto1thatthe likelihood equations have a solution θn(a)withinQa;takingtherootclosesttoθ0 completes the proof. To prove (a), write (

1 1 1 1 (l(θ) l(θ )) = (θ θ )T l˙(θ ) (θ θ )T ¨l(θ ) (θ θ ) n − 0 n − 0 0 − 2 − 0 −n 0 − 0 # $ d d d n 1 + (θ θ )(θ θ )(θ θ ) γ (X )M (X ) 6n j − j0 k − k0 l − l0 jkl i jkl i &j=1 &k=1 &l=1 &i=1 (b) = S1 + S2 + S3 where, by A3(ii), 0 γ (x) 1. Furthermore, by A3(ii) and A4, ≤| jkl |≤ (c) S 0 , 1 →p 1 (d) S (θ θ )T I(θ )(θ θ ) , 2 →p −2 − 0 0 − 0 where

(e) (θ θ )T I(θ )(θ θ ) λ θ θ 2 = λ a2 − 0 0 − 0 ≥ d| − 0| d T T T T and λd is the smallest eigenvalue of I(θ0)(recallthatsupx(x Ax)/(x x)=λ1,infx(x Ax)/(x x)= λ where λ ... λ > 0aretheeigenvaluesofA symmetric and positive definite), and d 1 ≥ ≥ d 1 (f) S (θ θ )(θ θ )(θ θ )Eγ (X )M (X ) . 3 →p 6 j − j0 k − k0 l − l0 jkl 1 jkl 1 &j &k &l 8 CHAPTER 4. EFFICIENT LIKELIHOOD ESTIMATION AND RELATED TESTS

Thus for any given ', a > 0, for n sufficiently large with probability larger than 1 ',forallθ Q , − ∈ a (g) S

( 1 1 − (o) ¨l(θ∗ ) −n n # $ exists with high probability, and satisfies 1 1 − 1 (p) ¨l(θ∗ ) I(θ )− . −n n →p 0 # $ Hence we can use (l) to write, on Gn, 1 (q) √n(θn θ0)=I− (θ0)Zn + op(1) − 1 1 d I− (θ0)Z Nd(0,I− (θ0)) . ( → ∼ 1. MAXIMUM LIKELIHOOD AND EFFICIENT LIKELIHOOD ESTIMATION 9

This proves (ii). It also follows from (n) that

T 1 T 1 2 (r) √n(θ θ ) ¨l(θ ) √n(θ θ ) Z I− (θ )Z χ , n − 0 −n n n − 0 →d 0 ∼ d # $ and that, since( I(θ)iscontinuousat( θ(0, T T 1 2 (s) √n(θ θ ) I(θ )√n(θ θ ) Z I− (θ )Z χ . n − 0 n n − 0 →d 0 ∼ d To prove (iii),( we write,( on the( set Gn,

T 1 T 1 (t) l(θ )=l(θ )+l˙ (θ )(θ θ ) √n(θ θ ) ¨l(θ∗ ) √n(θ θ ) 0 n n 0 − n − 2 0 − n −n n 0 − n # $ where θ θ ( θ θ (.Thus ( ( ( | n∗ − 0|≤| n − 0| 2logλ =2l(θ ) l(θ ) n ( { n − 0 } 1 T 1 =0+2√n(θ θ ) ¨l(θ∗ ) √n(θ θ ) ( (2 n − 0 −n n n − 0 # $ T = Dn I(θ0)Dn(+ op(1) , with Dn (√n(θn θ0) T ≡1 − d D I(θ0)D where D Nd(0,I− (θ0)) → ∼ ( χ2 . ∼ d Finally, (v) is trivial since everything is evaluated at the fixed point θ0. 2

Proof. Theorem 1.3.Firstnotethat

1 1 1 ··· ¨ln(θn)= ¨ln(θ0)+ l n (θ∗ )(θn θ0) n n n n − 1 = ¨l (θ )+O (1) θ θ n n 0 p | n − 0| so that 1 1 1 − 1 − (a) ¨l (θ ) = ¨l (θ ) + O (1) θ θ −n n n −n n 0 p | n − 0| # $ # $ and 1 1 1 (b) l˙ (θ )= l˙ (θ )+ ¨l (θ )√n(θ θ ) √n n n √n n 0 n n 0 n − 0

1 T 1 ··· + √n(θn θ0) l n (θ∗ ) (θn θ0) . 2 − n n − # $ Therefore it follows that 1 1 − 1 √n(θˇ θ )=√n(θ θ )+ ¨l (θ ) l˙ (θ ) n − 0 n − 0 −n n n √n n n # $ = √n(θ θ ) n − 0 1 1 − + ¨l (θ ) + O (1) θ θ −n n 0 p | n − 0| .# $ / 10 CHAPTER 4. EFFICIENT LIKELIHOOD ESTIMATION AND RELATED TESTS

1 1 T 1 ··· Zn + ¨ln(θ0)√n(θn θ0)+ √n(θn θ0) l n (θ∗ ) (θn θ0) · n − 2 − n n − ! # $ " 1 1 − = ¨l (θ ) Z + O (1) θ θ Z −n n 0 n p | n − 0| n # $ 1 + O (1) ¨l (θ )√n θ θ 2 p n n 0 | n − 0| 1 T 1 ··· + Op(1) √n(θn θ0) l n (θ∗ ) (θn θ0) 2 − n n − # $ 1 2 = I− (θ0)Zn + op(1) + Op(1)√n θn θ0 1 | − | = I− (θ0)Zn + op(1) . Here we used

1 ··· l n (θ∗ )(θn θ0)(θn θ0) √n n − − 0 0 0 d d 0 0 0 1 ··· = √n(θnk θ0k)(θnl θ0l) l jkl (θ∗ X) − − n n| 0 k=1 l=1 0 0 & & 0 0 d n 0 3 2 1 ··· d √n θn θ0 l jkl (θ∗ Xi) ≤ | − | n | n| | &j=1 &i=1 = O (1)√n θ θ 2 p | n − 0| d since θnk θ0k θn θ0 for k =1,...,d and x d max1 k d xk d k=1 xk . 2 | − |≤| − | | |≤ ≤ ≤ | |≤ | | - Exercise 1.1 Show that K(P, Q) 2H2(P, Q). ≥ 2. THE WALD, LIKELIHOOD RATIO, AND SCORE (OR RAO) TESTS 11

2TheWald,Likelihoodratio,andScore(orRao)Tests Let θ Θbefixed.Fortesting 0 ∈ (1) H : θ = θ versus K : θ = θ 0 % 0 recall the three test statistics

(2) 2logλ 2 l (θ ) l (θ ) , n ≡ { n n − n 0 } (3) W (√n(θ (θ )T I (θ )√n(θ θ ) , n ≡ n − 0 n n n − 0 and ( ' ( (

T 1 (4) R Z I− (θ )Z n ≡ n 0 n where 1 1 (5) Z l˙ (θ )= l˙ (θ X) . n ≡ √n n 0 √n n 0| Theorem 1.2 described the null hypothesis behavior of these statistics; all three converge in distri- 2 bution to χd when P0 = Pθ0 is true. We now examine their behavior under alternatives, i.e. for X ,...,X i.i.d. P with θ = θ . 1 n θ % 0 Theorem 2.1 (Fixed alternatives). Suppose that θ = θ and A0 - A4 hold at both θ and θ . % 0 0 Then: 1 (6) 2logλ 2K(P ,P )=2K(P ,P ) > 0 , n n →p θ θ0 true hypothesized 1 ( (7) W (θ θ )T I(θ)(θ θ ) > 0 . n n →p − 0 − 0 If, furthermore, A5. E l˙ (θ X) < for i =1,...,d,holds,then θ| i 0| | ∞ 1 T 1 (8) R E l˙(θ X) I− (θ )E l˙(θ X) > 0 n n →p θ{ 0| } 0 θ{ 0| } if E l˙(θ X) =0. θ{ 0| } % Proof. When θ = θ is really true, % 0 2 2 (a) log λ = l(θ ) l(θ ) n n n{ n − 0 } 2 2 ( = l(θ() l(θ ) + l(θ ) l(θ) n{ − 0 } n{ n − } n 2 pθ 2 = log (Xi)+ (l(θn) l(θ) n pθ0 n{ − } &i=1 p ( 2E log θ (X) +0 χ2 =2K(P ,P ) →p θ p · d θ θ0 ! θ0 " 12 CHAPTER 4. EFFICIENT LIKELIHOOD ESTIMATION AND RELATED TESTS by the WLLN and Theorem 1.2. Also, by the Mann-Wald (or continuous mapping) theorem, 1 (b) W =(θ θ )T I (θ )(θ θ ) (θ θ )T I(θ)(θ θ ) , n n n − 0 n n n − 0 →p − 0 − 0 and, since ( ' ( ( n 1 1 1 (c) Z = l˙ (θ )= l˙(θ X ) E l˙(θ X) , √n n n n 0 n 0| i →p θ{ 0| } &i=1 it follows that

1 T 1 (d) R E l˙(θ X) I− (θ )E l˙(θ X) . n n →p θ{ 0| } 0 θ{ 0| } 2

Corollary 1 (Consistency of the likelihood ratio, Wald, and score tests). If Assumptions A0-A5 hold, then the tests are consistent: i.e. if θ = θ ,then % 0 (9) P (LR test rejects H)=P (2 log λ χ2 ) 1 , θ θ n ≥ d,α → (10) P ( rejects H)=P (W ( χ2 ) 1 , θ θ n ≥ d,α → (11) P (score test rejects H)=P (R χ2 ) 1 , θ θ n ≥ d,α → assuming that E l˙(θ X) =0. θ{ 0| } %

1/2 It remains to examine the behavior these three tests under local alternatives, θn = θ0 + tn− with t =0.WefirstexamineZ and θ under θ using Le Cam’s third lemma 3.3.4. % n n 0

( 1/2 Theorem 2.2 Suppose that A0-A4 hold. Then, if θn = θ0 + tn− is true, then under Pθn

1 (12) √n(θ θ ) D + t N (t, I− (θ)) ; n − 0 →d ∼ d furthermore,( 1 (13) Z (θ ) l˙(θ X) Z + I(θ )t N (I(θ )t, I(θ )) . n 0 ≡ √n 0| →d 0 ∼ d 0 0

Hence we also have under Pθn ,

(14) √n(θn θn)=√n(θn θ0) √n(θn θ0) − − − − 1 d D + t t = D Nd(0,I− (θ0)) ; ( → ( − ∼ i.e. θn is locally regular. Furthermore, 1 (15)( Z (θ )=Z (θ ) ¨l (θ∗ ) √n(θ θ ) n n n 0 − −n n n n − 0 # $ Z + I(θ )t I(θ )t = Z N (0,I(θ )) . →d 0 − 0 ∼ d 0 2. THE WALD, LIKELIHOOD RATIO, AND SCORE (OR RAO) TESTS 13

Proof. From the proof of theorem 1.2 we know that θn is asymptotically linear under P0 = Pθ0 : n 1 √n(θ θ )= l (θ X )+o (1) ( n − 0 √n θ 0| i p &i=1 ( ( where ˜l (x) I 1(θ )l˙ (x). Furthermore, it follows from theorem 1.2 part (vi) that the log likeli- θ ≡ − 0 θ hood ratio is asymptotically linear: dP n 1 log θn = l(θ ) l(θ )=tT Z tT I(θ )t + o (1) . dP n n − 0 n − 2 0 p θ0

Let a Rd.ThenwithT aT √n(θ θ )itfollowsfromthemultivariateCLTthat ∈ n ≡ n − 0 T Tn a √n((θn θ0) dP n dP n− θn = θn log n 1 dP 2 1 log dP n 2 θ0 ( θ0 n T 1 a lθ(θ0 Xi) 0 = | + + op(1) √n tT l˙ (θ X ) σ2/2 i=1 # θ 0 i $ # − $ & ( | 0 aT I 1(θ )aaT t N , − 0 . →d 2 σ2/2 aT tσ2 ## − $ # $$ Thus the hypothesis of Le Cam’s third lemma 3.3.4 is satisfied with c = aT t,andwededucethat, under Pθn ,

aT √n(θ θ ) T T 1 T n 0 a t a I− (θ0)aat dP n− θn d N2 2 , T 2 . 1 log dP n 2 → +σ /2 a tσ ( θ0 ## $ # $$

In particular, under Pθn , T T T 1 a √n(θ θ ) N(a t, a I− (θ )a) , n − 0 →d 0 and by the Cram´( er - Wold device this implies that under Pθn 1 √n(θ θ ) N (t, I− (θ )) . n − 0 →d d 0 This, in turn,( implies that 1 √n(θ θ ) N (0,I− (θ )) . n − n →d d 0 under P .Theproofofthesecondpartissimilar,buteasier,bytakingT aT Z (θ )whichis θn ( n ≡ n 0 already a linear statistic. 2

1/2 Corollary 1 If A0-A4 hold, then if θn = θ0 + tn− ,underPθn : T 2 (16) 2logλn d (D + t) I(θ0)(D + t) χd(δ) , → T 2∼ (17) Wn d (D + t) I(θ0)(D + t) χd(δ) , →( T 1 ∼ T 2 (18) R (Z + I(θ )t) I− (θ )(Z + I(θ )t)=(D + t) I(θ )(D + t) χ (δ) n →d 0 0 0 0 ∼ d T where δ = t I(θ0)t. 14 CHAPTER 4. EFFICIENT LIKELIHOOD ESTIMATION AND RELATED TESTS

Proof. This follows from theorem 2.2, the Mann - Wald theorem, and the fact that T 1 2 X N (µ, Σ) implies X Σ− X χ (δ) ∼ d ∼ d T 1 with δ = µ Σ− µ. 2

Corollary 2 If A0 - A4 hold, then with Tn =2logλn, Wn,orRn, (19) P (T χ2 ) P (χ2(δ) χ2 ) . θn n ≥ d,α → d ≥ d,α ( Three Statistics for Testing a Composite Null Hypothesis Now consider testing θ Θ θ Θ: θ = θ ;i.e. ∈ 0 ≡{ ∈ 1 10} H : θ = θ ,θ =anything versus K : θ =(θ ,θ ) =(θ ,θ ) 1 10 2 1 2 % 10 2 where θ (θ ,θ ) Rm Rd m = Rd.RecallthecorrespondingpartitioningofI(θ)andI 1(θ) ≡ 1 2 ∈ × − − and the matrices I11 2, I22 1 introduced in section 3.2. The likelihood ratio,· Wald,· and Rao (or score) statistics for testing H versus K are

supθ Θ L(θ X) L(θn X) (20) 2logλn with λn ∈ | = | 0 ≡ supθ Θ0 L(θ X) L(θn X) ∈ | ' | (or ' L(θn X) (21) 2logλn with λn | 0 ≡ L(θn X) ( | where θ , θ0(are consistent solutions( of the likelihood equations under K and H respectively); n n ( T (22) Wn √n(θn1 θ10) I11 2√n(θn1 θ10) , ( (≡ − · − and ( ' ( T 0 1 0 0 (23) R Z (θ )I− (θ )Z (θ ) n ≡ n n n n n where θ0 (θ0 )isanMLE(ELE)ofθ Θ . n n ' ' ' ∈ 0 Now under H : θ Θ we have ∈ 0 ' ( 1 (24) √n(θn1 θ10) d D1 Nm(0,I11− 2) − → ∼ · where ( 1 1 D1 1 I11− 2(Z1 I12I22− Z2) D = = I− (θ0)Z = 1· − 1 D2 I22− 1(Z2 I21I11− Z1) # $ # · − $ and 0 0 Zn1(θn) Zn(θn)= 0 1 Zn2(θn) 2 ( ( Z (θ ) I (θ )√n(θ0 θ )+o (1) = n1 (0 − 12 n∗ n2 − 02 p 0 # $ 1 ( Z (θ ) I (θ )I− Z (θ )+o (1) = n1 0 − 12 0 22 n2 0 p 0 # $ 1 Z1(θ0) I12(θ0)I22− Z2(θ0) Nm(0,I11 2) (25) d − · . → 0 ∼ 0 # $ # $ 2. THE WALD, LIKELIHOOD RATIO, AND SCORE (OR RAO) TESTS 15

The natural consequences of (24) and (25) for the likelihood ratio, Wald, and Rao statistics are summarized in the following theorem.

Theorem 2.3 (Likelihood ratio, Wald, Rao statistics for composite null under null). If A0 - A4 hold and θ Θ is true, then 0 ∈ 0

2logλn T 2 2 Wn d D1 I11 2D1 χm = χd (d m) .   → · ∼ − −  Rn( 

  T Proof. That Wn d D1 I11 2D1 follows from (24) and consistency of I11 2.Similarly, T → · 1 0 1 · Rn d D1 I11 2D1 follows from (25) and In− (θn) p I− (θ0). To prove the claimed convergence of → · → 2logλn,write ' ' ' 0 ( 2logλn =2ln(θn) ln(θn) { − 0 =2ln(θn) ln(θ0) (ln(θn) ln(θ0)) ( { ( − ( − − } where ( ( T T 1 (a) 2 l (θ ) l (θ ) D I(θ )D = Z I− (θ )Z { n n − n 0 }→d 0 0 by our proof( of theorem 1.2, and 0 T 1 (b) 2 l (θ ) l (θ ) Z I− (θ )Z , { n n − n 0 }→d 2 22 0 2 0 again by the( proof of theorem 1.2. In fact, by the asymptotic linearity of θn (and θn)provedthere, the convergences in (a) and (b) imply that ( ( T 1 T 1 2logλn d Z I− (θ0)Z Z2 I22− Z2 → 1− T 1 1 =(Z1 I12I22− Z2) I11− 2(Z1 I12I22− Z2) ( T − · − = D1 I11 2D1 · 1 where we have used the block inverse form of I− (θ0)givenin(3.2.x)andthematrixidentity (3.2.15) with the roles of “1” and “2” interchanged. 2

Now under local alternatives the situation is as follows:

Theorem 2.4 If A0 - A4 hold, and θ = θ + tn 1/2 with θ Θ ,thenunderP n 0 − 0 ∈ 0 θn

2logλn T 2 2 Wn d (D1 + t1) I11 2(D1 + t1) χm(δ)=χd (d m)(δ)   → · ∼ − −  Rn(  T whre δ = t1 I11 2t1. · 16 CHAPTER 4. EFFICIENT LIKELIHOOD ESTIMATION AND RELATED TESTS

3SelectedExamples Now we consider several example to illustrate the theory of the preceding two sections and its limitations.

Example 3.1 Let X ,...,X be i.i.d. N(µ, σ2); θ =(µ, σ2) Θ=R R+.Consider 1 n ∈ × (i) Estimation of θ.

(ii) Testing H : µ =0versusK : µ =0. 1 1 % (iii) Testing H : θ =(0,σ2) θ versus K : θ = θ . 2 0 ≡ 0 2 % 0 (i) Now the is:

n 2 n/2 1 2 L(θ)=(2πσ )− exp 2 (Xi µ) . 1−2σ − 2 &i=1 Thus the MLE of θ is n 1 θ =(X ,S2)whereS2 = (X X )2 , n n n i − n &i=1 and ' 1 0 I(θ)= σ2 0 1 # 2σ4 $ since 1 1 l(θ X )= (X µ)2 log σ2 | 1 −2σ2 1 − − 2 so that 1 1 l˙ (X )= (X µ) , ¨l (X )= , µ 1 σ2 1 − µµ 1 − σ2 2 2 (X1 µ) 1 (X1 µ) 1 l˙ 2 (X )= − , ¨l 2 2 (X )= − + . σ 1 2σ4 − 2σ2 σ σ 1 − σ6 2σ4 Now

(1) θ =(X ,S2) (µ, σ2)=θ, n n n →a.s. and ' 1 (2) √n(θ θ) N (0,I− (θ)) n − →d 2 with ' σ2 0 (3) I 1(θ)= . − 02σ4 # $ The almost sure consistency stated in (1) and the limiting distribution in (2) follow from direct application of the strong law of large numbers (proposition 2.2.2) and the central limit theorem (proposition 2.2.3 ) respectively, after easy manipulations and Slutsky’s theorem. Alternatively, the 3. SELECTED EXAMPLES 17 in probability version of (1) and the limiting distribution in (2) follow from theorem 1.2 assuming that the normal model for the Xi’s holds.

(ii). The likelihood ratio statistic for testing H1 is 2 1 n 2 n/2 L(θn) L(X,S ) n− 1 Xi λn = = = , 0 1 n 2 1 n 2 L(θn) L(0,n− 1 Xi ) n− 1 (Xi X) ' ! - − " and hence - - ' 2 X 2logλn = n log 1 1 n 2 ; − 1 − n− 1 Xi 2 note that log(1 x) x for x -0. The Wald statistic is − − ∼ → 2 2 nX √n X Wn = √n(X 0) I11 2 √n(X 0) = = . { − } · { − } S2 S ! " Finally, the Rao or score statistic' is 0 T 0 1 0 Rn = Zn(θn) I(θn)− Zn(θn) T T √nXn 1 n 2 √nXn n 1 n X2 n− 1 Xi 0 n 1 n X2 = '− 1 'i ' 1 n 2 2 − 1 i 02(n− X ) 1 P0 2 # - 1 i $ 1 P0 2 2 - √nX = n .  1 n 2   n− 1 Xi  2 6 If θ = θ0 =(0,σ)soH1 -holds, then 2logλ ,W,R χ2 . n n n →d 1 If µ =0,soθ/Θ ,then % ∈ 1 1 µ2 σ2 2logλ log 1 = log > 0 , n n →p − − σ2 + µ2 − σ2 + µ2 # $ # $ 1 µ2 W > 0 , and n n →p σ2 1 µ2 R > 0 . n n →p σ2 + µ2

(iii) The likelihood ratio statistic λn for testing H2 is 2 2 n/2 L(X,S ) (2πS )− exp( n/2) λn = = − L(0,σ2) (2πσ2) n/2 exp( n X2/2σ2) 0 0 − − 1 i 0 n/2 n S2 − 1 = exp - X2 n/2 σ2 2σ2 i − 0 1 0 1 2 # $ & so that n 1 S2 2logλ = X2 n n log n σ2 i − − σ2 0 1 # 0 $ & 2 S2 S2 nX = n 1 log + . σ2 − − σ2 σ2 ! 0 # 0 $" 0 18 CHAPTER 4. EFFICIENT LIKELIHOOD ESTIMATION AND RELATED TESTS

The Wald statistic is T √n(X 0) 1/S2 0 √n(X 0) Wn = − − √n(S2 σ2) 01/(2S4) √n(S2 σ2) # − 0 $ # $# − 0 $ 2 nX √n(S2 σ2) 2 = + { − 0 } . S2 2S4 The Rao or score statistic is given by T √nX/σ2 2 √nX/σ2 0 σ0 0 0 R = n 1 n X2 n 1 n X2 n 1 √n − 1 i 02σ4 1 √n − 1 i 1 2 σ2 σ2 1 2 0 1 2 σ2 σ2 1 2 0 P0 − # $ 0 P0 − 2 nX 7n n 1 n X28 2 7 8 = + − 1 i 1 . σ2 2 σ2 − 0 ! -0 " If H2 holds, then 2logλ ,W,R χ2 . n n n →d 2

Exercise 3.1 What are the limits in probability of n 12logλ , n 1W ,andn 1R under θ = θ ? − n − n − n % 0

Exercise 3.2 In the context of Example 3.1, what are the likelihood ratio, Wald, and score statis- tics for testing H : σ2 = σ2 versus K : σ2 = σ2? 3 0 3 % 0 Example 3.2 (One parameter exponential family). Suppose that p (x)=exp(θT(x) A(θ)) with θ − respect to µ.Then

l˙ (x)=T (x) A&(θ),I(θ)=Var (T (X)) , θ − θ and the likelihood equation may be written as n 1 T (X )=A&(θ) . n i &i=1 Now A (θ)=E T (X) ,andA (θ)=Var (T (X)) > 0, so the right side in (4) is strictly increasing & θ{ } && θ in θ.Thus(4)hasatmostonerootθn,and (4) √n(θ θ) N(0, 1/I(θ)) = N(0, 1/V ar (T (X))) . n − →d ' θ Example 3.3' (Multi-parameter exponential family; Lehmann and Cassella, TPE, pages 23 - 32). Suppose that

d p (x)=exp η (θ)T (x) B(θ) h(x) θ  j j −  &j=1  with respect to some dominating measure µ.HereB :Θ R and ηj :Θ R, Tj : X R for   → → → j =1,...,d.HereΘ Rk for some k.Itisfrequentlyconvenienttousetheη ’s as the parameters ⊂ j and write the density in the canonical form

d p (x) p(x; η)=exp η T (x) A(η) h(x) . η ≡  j j −  &j=1    3. SELECTED EXAMPLES 19

The natural parameter space Ξofthefamilyis

Ξ η Rd : exp ηT T (x) h(x)dµ(x) < . ≡{ ∈ { } ∞} 9 We will assume that the Tj’s are affinely independent:i.e.theredonotexistconstantsa1,...,ad and b R such that d a T (x)=b with probability 1. ∈ j=1 j j By Lehmann and Casella, TPE, theorem 5.8, page 27, we can differentiate under the expecta- - tions to find that the score vector for η is given by

˙ ∂ lηj (x)=Tj(x) A(η),j=1,...,d, − ∂ηj and since this has expectation 0 under pη, ∂ 0=Eη(Tj(X)) A(η),j=1,...,d. − ∂ηj If the likelihood equations have a solution, it is unique (and is the MLE) since l(η)isastrictly concave function of η in view of the fact that l(η)hasHessian(timesminusone) ∂2 ∂2 ¨l(η X)= l(η) = A(η) =(Cov [T (X),T (X)]) | − ∂η ∂η ∂η ∂η η j l # j l $ # j l $ which is positive definite by the assumption of affine independence of the Tj’s.

Example 3.4 (Cauchy location family). Suppose that p (x)=g(x θ)withg(x)=π 1(1+x2) 1, θ − − − x R.Then ∈ g 2(X θ) l˙ (X)= & (X θ)= − , θ − g − 1+(X θ)2 − I(θ)=1/2, and the likelihood equation becomes n n ˙ ˙ Xi θ) lθ(θ X)= lθ(Xi)=2 − 2 =0 | 1+(Xi θ) &i=1 &i=1 − if and only if n 0= (X θ) 1+(X θ)2 , i − { j − } i=1 j=i & %' where the right side is a polynomial in θ of degree 2(n 1) + 1 = 2n 1whichcouldhaveasmany 1 − − as 2n 1roots.Letθn (Xi)=F− (1/2). Then − ≡ n √n(θ θ) N(0,π2/4) , n − →d and the one-step adjustment or “method of scoring”) estimator is

n 1 1 θˇ = θ + I(θ )− l˙ (X ; θ ) n n n n θ i n 1 i=1 2 n & '4 X θ = θ + i − n . n 2 n 1+(Xi θn) &i=1 − 20 CHAPTER 4. EFFICIENT LIKELIHOOD ESTIMATION AND RELATED TESTS

Remark: Let rn denote the (random) number of roots of the likelihood equation. Then with probability one the roots are simple, and they alternately correspond to local minima and local maxima of the likelihood. Thus r is odd, there are (r +1)/2localmaxima,(r 1)/2local n n n − minima, and one global maximum. Thus the number of roots corresponding to “false” maxima is (r 1)/2. Reeds (1985) shows that (r 1)/2 Poisson(1/π), so that as n we can expect n − n − →d →∞ to see relatively few roots corresponding to local maxima which are not global maxima. In fact,

1/π P ((r 1)/2 1) P (Poisson(1/π) 1) = 1 e− = .272623.... n − ≥ → ≥ − Example 3.5 (Normal mixture models; see TPE, page 442, example 5.6). Suppose that p (x)=pN(µ, σ2)+(1 p)N(ν,τ 2) θ − 1 (x µ)2 1 (x ν)2 = p exp − +(1 p) exp − √ 2 − 2σ2 − √ 2 − 2τ 2 2πσ # $ 2πτ # $ with θ =(p, µ, σ,ν,τ). The simpler case of θ =(1/2,µ,σ,0, 1) will illustrate the phenomena we want to illustrate here. When θ = θ we may reparametrize by θ =(µ, σ) R R+ =Θ,andthe 0 ∈ × density can be written as 1 1 1 x µ (5) p (x)= φ(x)+ φ − . θ 2 2 σ σ # $ Then, if X1,...,Xn are i.i.d. pθ, sup L(θ X)= almost surely. θ Θ | ∞ ∈ To see this, take µ =anyX ,andthenletσ 0. Thus MLE’s do not exist. i → However, A0 - A4 hold for this model, and hence there exists a consistent, asymptotically efficient sequence of roots of the likelihood equations. Alternatively, one-step estimators starting with moment estimators are also asymptotically efficient.

Example 3.6 (An inconsistent MLE). Suppose that 1 1 θ x θ pθ(x)=θ 1[ 1,1](x)+ − 1 | − | 1A(θ)(x) 2 − δ(θ) − δ(θ) # $ where θ Θ [ 1, 1], δ δ(θ)isdecreasingandcontinuouswithδ(0) = 1 and 0 <δ(θ) 1 θ ∈ ≡ − ≡ ≤ − for 0 <θ<1, and A(θ) θ δ(θ),θ+ δ(θ)]. ≡ − 1 Note that p0(x)=(1 x )1[ 1,1](x) triangular density, while p1(x)=2− 1[ 1,1](x)=uniform −| | − ≡ − density. Note that A0-A2 hold, and, in addition, pθ(x)iscontinuousinθ for all x.ThusaMLEexists for all n 1sinceacontinuousfunctiononacompactset[0, 1] achieves its maximum on that set. ≥ Proposition 3.1 (Ferguson). Let θ an MLE of θ for sample size n.Ifδ(θ) 0rapidly n ≡ → enough as θ 1, then θ 1a.s. P as n for every θ [0, 1]. In fact, the function → n → θ →∞ ∈ δ(θ)=(1 θ)exp( (1 θ) 4 +1)works;forthischoiceof' δ(θ)itfollowsthatn1/4(1 M ) 0 − − − − − n →a.s. where M =max X ,...,X' . n { 1 n} The details of this example are written out in Ferguson’s ACourseinLargeSampleTheory, pages 116 - 117. The hypothesis that is violated in this example is that the family of log-likelihoods fails to have an integrable envelope. 3. SELECTED EXAMPLES 21

Other examples of inconsistent MLE’s are given by Le Cam (1990) and by Boyles, Marshall, and Proschan (1985). Here is another situation in which the MLE fails, though for somewhat different reasons.

Example 3.7 (Neyman-Scott). Supposethat(X ,Y ) N ((µ ,µ ),σ2I)fori =1,...,n.Con- i i ∼ 2 i i sider estimation of σ2.NowZ X Y N(0, 2σ2), and therefore i ≡ i − i ∼ n 1 χ2 Z2 σ2 n 2n i ∼ n &i=1 is an unbiased and consistent estimator of σ2.However n 2 n 1 2 2 L(θ X,Y)=(2πσ )− exp 2 (Xi µi) +(Yi µi) | 1−2σ { − − }2 &i=1 2 and therefore the MLE of µi is µi =(Xi + Y2)/2anditfollowsthattheMLEofσ is

n n n 2 1 2 1 ' 2 2 σ = Zi = (Xi µi) + (Yi µi) . 4n 2n . − − / &i=1 &i=1 &i=1 Thus σ2' σ2/2asn .Whatwentwrong?notethatthedimensionalityoftheparameter' ' n →a.s. →∞ space for this model is n+1 whichincreases with n,sothetheorywehavedevelopedsofardoesnot apply.' One way out of this difficulty was proposed by Kiefer and Wolfowitz (1956) in their paper on “Consistency of the maximum likelihood estimator in the presence of infinitely many nuisance parameters”, Ann. Math. Statist. 27,887-906.Forresultsonasymptoticnormalityofthe MLE’s in the semiparametric mixture models proposed by Kiefer and Wolfowitz (1956), see Van der Vaart (1996), “Efficient maximum likelihood estimation in semiparametric mixture models”, Ann. Statist. 24,862-878.

Example 3.8 (Bivariate Poisson model). Suppose that U Poisson(µ), V Poisson(λ), and ∼ ∼ W Poisson(ψ)areallindependent,andlet ∼ X U + W, Y = V + W. ≡ Then X Poisson(µ + ψ), Y Poisson(λ + ψ), and jointly ∼ ∼ (X, Y ) bivariate Poisson(µ, λ,ψ): ∼ for positive integers x and y

x y ∧ x w y w w µ − λ − ψ (λ+µ+ψ) P (X = x, Y = y; µ, λ,ψ)= e− θ w!(x w)!(y w)! w=0 & − − where θ =(µ, λ,ψ). Thus under ψ =0,X and Y are independent (ψ =0impliesW =0a.s.and then X = U and Y = V a.s.). Consider testing H : ψ =0(independence)versusK : ψ>0basedonasampleofn i.i.d. pairs from Pθ.Notethatmaximumlikelihoodestimationofθ =(µ, λ,ψ)forgeneralθ is not at all simple, so both the Wald and LR statistics must be calculated numerically. However, for 22 CHAPTER 4. EFFICIENT LIKELIHOOD ESTIMATION AND RELATED TESTS

θ Θ θ Θ: ψ =0 , X and Y are just independent Poisson rv’s and some calculation shows ∈ 0 ≡{ ∈ } that the scores, for any point θ =(λ, µ, 0) Θ are given by 0 ∈ 0 X l˙ (X, Y ; θ )= 1+ , µ 0 − µ Y l˙ (X, Y ; θ )= 1+ , µ 0 − λ XY l˙ (X, Y ; θ )= 1+ . ψ 0 − µλ

0 Hence, under H the MLE of θ is θn =(Xn, Y n, 0), and the Rao or score statistic for testing H is

1 n 2 n− i=1 XiY'i Rn = n 1 XnY n X Y − ! -n n " 1 n 2 n− XiYi XnY n = n i=1 − . : - XnY n ; since 1/µ 01/µ I(θ )= 01/λ 1/λ , 0   1/µ 1/λ 1/µ +1/λ +1/(µλ)   so that

1 I22 1 = I22 I21I11− I12 · − µ 0 1/µ =1/µ +1/λ +1/(µλ) (1/µ, 1/λ) − 0 λ 1/λ # $# $ =1/(µλ) .

Note that Rn is a close relative of the classical correlation coefficient, and in fact, under H it follows 2 fairly easily that Rn = nrn + op(1). Note that the parameter space for this model is Θ= (µ, λ,ψ):µ>0,λ>0,ψ 0 ,whichis { ≥ } not closed. Moreover, the points in the null hypothesis ψ =0areonaboundaryoftheparameter space. Thus the theory we have developed does not quite apply. Nevertheless, under H we have R χ2. n →d 1

Example 3.9 (The Multinomial Distribution). Suppose that X1,...,Xn are i.i.d. Multk(1,p). Then n N X Mult (n, p) . n ≡ i ∼ k &i=1 Then the log likelihood is

n 1! Xi1 Xik l(p X)=log p1 pk | . Xi1! Xik! ··· / %i=1 ··· k n 1! = Nj log pj + log , Xi1! Xik! &j=1 &i=1 # ··· $ 3. SELECTED EXAMPLES 23 and the constrained log likelihood is

n k 1! Xi1 Xik l(p,λX)=log p1 pk + λ( pj 1) | . Xi1! Xik! ··· / − %i=1 ··· &j=1 k k = N log p + λ( p 1) + constant in p . j j j − &j=1 &j=1 Thus the likelihood equations are

˙ Nj 0=lpj (p,λX)= + λ, j =1,...,k | pj and

k 0= p 1 . j − &j=1 The solution of the first set of k equations yields N p = j ,j=1,...,k. j − λ

But to satisfy' the constraint we must have k n 1= p = , j −λ &j=1 or λ = n,andthustheMLEof' p is p = N /n.Thescorevector(forn =1)becomes − n X X p l˙(p X )= 1j 1 = 1j'− j . | 1 p − p # j $ # j $j=1,...,k Thus the information matrix is

I(p)=E l˙(p X )l˙(p X )T =diag(1/p )(diag(p ) ppT )diag(1/p ) { | 1 | 1 } j j − j =diag(1/p ) 1 1T . j − Since I(p)p =1 1 =0,thismatrixissingular.Butnotethatithasageneralizedinversegiven − by I (p)=diag(p ) ppT : − j −

I(p)I−(p)I(p)=I(p) .

In fact, we know by direct calculations and the multivariate CLT that

√n(p p) N (0,I−(p)) . n − →d k Note that the natural Rao statistic is in fact the usual chi-square statistic: ' k (N np )2 ZT (p )I (p )Z (p )= j j0 n 0 − 0 n 0 − npj0 &j=1 24 CHAPTER 4. EFFICIENT LIKELIHOOD ESTIMATION AND RELATED TESTS where 1 N np N np T Z (θ )= 1 − 10 ,..., k − k0 n 0 √n p p # 10 k0 $ and

T I−(p )=diag(p ) p p . 0 j0 − 0 0 The Wald statistic is given by

k (N np )2 √n(p p)T I(p)√n(p p)= j j0 . n n − − − npj &j=1 ' ' ' ' For more on problems involving singular information' matrices, see Rotnitzky, Cox, Bottai, and Robins (2000). 4. CONSISTENCY OF MAXIMUM LIKELIHOOD ESTIMATORS 25

4ConsistencyofMaximumLikelihoodEstimators Some Uniform Strong Laws of Large Numbers Suppose that:

A. X, X ,...,X are i.i.d. P on the measurable space ( , ). 1 n X A B. For each θ Θ, f(x, θ)isameasurable,real-valuedfunctionofx, f( ,θ) L (P ). ∈ · ∈ 1 Let = f( ,θ):θ Θ .Sincef( ,θ) L (P )foreachθ, F { · ∈ } · ∈ 1

g(θ) Ef(X,θ)= f(x, θ)dP (x) Pf( ,θ) ≡ ≡ · 9 exists and is finite. Moreover, by the strong law of large numbers,

n 1 Pnf( ,θ) f(x, θ)dPn(x)= f(Xi,θ) · ≡ n 9 &i=1 (1) Ef(X,θ)=Pf( ,θ)=g(θ). →a.s. · It is often useful and important to strengthen (1) to hold uniformly in θ Θ: ∈

(2) sup Pnf( ,θ) Pf( ,θ) a.s. 0 . θ Θ | · − · |→ ∈ Note that the left side in (2) is equal to

Pn P sup Pnf Pf . - − -F ≡ f | − | ∈F

Here is how (2) can be used: suppose that we have a sequence θn of estimators, possibly dependent on X ,...,X ,suchthatθ θ .Supposethatg(θ)iscontinuousatθ .Wewouldliketo 1 n n →a.s. 0 0 conclude that ' n ' 1 (3) Pnf( , θn)= f(Xi, θn) a.s. g(θ0). · n → &i=1 ' ' The convergence (3) does not follow from (1); but (3) does follow from (2):

Pnf( , θn) g(θ0) Pnf( , θn) g(θn) + g(θn) g(θ0) · − ≤ · − − 0 0 0 0 0 0 0 ' 0 sup0 Pnf'( ,θ) 'g(θ0) +0 g'(θn) g(θ00) 0 0 ≤ θ0 Θ · − 0 0 − 0 ∈ 0 0 0 0 0 0 0 0 = Pn0 P + g(θn)0 g0(θ0') 0 - − -F − 0 0 a.s. 0+0=0. 0 0 → 0 ' 0 The following theorems, due to Le Cam, give conditions on f and P under which (2) holds. The first theorem is a prototype for what are now known in empirical process theory as “Glivenko-Cantelli theorems”. 26 CHAPTER 4. EFFICIENT LIKELIHOOD ESTIMATION AND RELATED TESTS

Theorem 4.1 Suppose that: (a) Θis compact. (b) f(x, )iscontinuousinθ for all x. · (c) There exists a function F (x)suchthatEF(X) < and f(x, θ) F (x)forallx , θ Θ. ∞ | |≤ ∈X ∈ Then (2) holds; i.e. sup Pnf( ,θ) Pf( ,θ) a.s. 0 . θ Θ | · − · |→ ∈ The second theorem is a “one-sided” version of theorem 4.1 which is useful for the theory of maximum likelihood estimation.

Theorem 4.2 Suppose that: (a) Θis compact. (b) f(x, )isuppersemicontinuousinθ for all x;i.e.limsupn f(x, θn) f(x, θ)forall θn θ · →∞ ≤ → and all θ Θ. ∈ (c) There exists a function F (x)suchthatEF(X) < and f(x, θ) F (x)forallx , θ Θ. ∞ ≤ ∈X ∈ (d) For all θ and all sufficiently small ρ>0

sup f(x, θ&) θ θ <ρ | #− | is measurable in x. Then

limsupn sup Pnf( ,θ) a.s. sup Pf( ,θ)=supg(θ). →∞ θ Θ · ≤ θ Θ · θ Θ ∈ ∈ ∈ We proceed by first proving Theorem 4.2. Then Theorem 4.1 will follow as a consequence of Theorem 4.2. Proof. Theorem 4.2. Let

ψ(x, θ,ρ) sup f(x, θ&). ≡ θ θ <ρ | #− | Then ψ is measurable (for ρ sufficiently small), bounded by an integrable function F ,and

ψ(x, θ,ρ) f(x, θ)asρ 0by(b). . . Thus by the monotone convergence theorem

ψ(x, θ,ρ)dP (x) f(x, θ)dP (x)=g(θ). . 9 9

Let '>0. For each θ,findρθ so that

ψ(x, θ,ρθ)dP (x)

Pnf( ,θ) Pnψ( ,θj ,ρθj ), · ≤ · and hence

sup Pnf( ,θ) sup Pnψ( ,θj,ρθj ) θ Θ · ≤ 1 j m · ∈ ≤ ≤ a.s. sup Pψ( ,θj ,ρθj ) → 1 j m · ≤ ≤ sup g(θj)+' ≤ 1 j m ≤ ≤ sup g(θ)+'. ≤ θ Θ ∈ We conclude that limsupn sup Pnf( ,θ) a.s. sup g(θ)+'. →∞ θ Θ · ≤ θ Θ ∈ ∈ Letting ' 0completestheproof. 2 ↓ Proof. Theorem 4.1. Since f is continuous in θ,condition(d)ofTheorem2issatisfied:for any countable set D dense in θ : θ θ <ρ , { & | & − | } sup f(x, θ&)= supf(x, θ&) θ θ <ρ θ D | #− | #∈ where the right side is measurable since it is a countable supremum of measurable functions. Furthermore, g(θ)iscontinuousinθ:

g(θ)= limg(θ&)= lim f(x, θ&)dP (x)= f(x, θ)dP (x) θ θ θ θ #→ #→ 9 9 by the dominated convergence theorem. Now Theorem 4.1 follows from Theorem 4.2 applied to the functions h(x, θ) f(x, θ) g(θ)and h(x, θ): by Theorem 4.2 applied to h(x, θ):θ Θ , ≡ − − { ∈ } limsupn sup(Pnf( ,θ) g(θ)) 0a.s. →∞ θ Θ · − ≤ ∈ By Theorem 4.2 applied to h(x, θ):θ Θ , {− ∈ } limsupn sup(g(θ)) Pnf( ,θ)) 0a.s. →∞ θ Θ − · ≤ ∈ The conclusion of Theorem 4.1 follows since

0 sup Pnf( ,θ) g(θ) ≤ θ Θ | · − | ∈ =sup(Pnf( ,θ) g(θ)) sup(g(θ) Pnf( ,θ)) . θ Θ · − ∨ θ Θ − · ∈ ∈ 2

For our application of Theorem 4.2 to consistency of maximum likelihood, the following Lemma will be useful. 28 CHAPTER 4. EFFICIENT LIKELIHOOD ESTIMATION AND RELATED TESTS

Lemma 4.1 If the conditions of Theorem 4.2 hold, then g(θ)isupper-semicontinuous:i.e.

limsupθ θg(θ&) g(θ). #→ ≤

Proof. Since f(x, θ)isuppersemicontinuous,

limsupθ θf(x, θ&) f(x, θ)forallx ; #→ ≤ i.e. liminfθ θ f(x, θ) f(x, θ&) 0forallx. #→ − ≥ Hence it follows by Fatou’s lemma that: ;

0 Eliminfθ θ f(X,θ) f(X,θ&) ≤ #→ − liminfθ θE f(X,θ) f(X,θ&) ≤ #→ : − ; = Ef(X,θ) limsupθ θEf(X,θ&); − : #→ ; i.e.

limsupθ θEf(X,θ&) Ef(X,θ)=g(θ) . #→ ≤ 2

Now we are prepared to tackle consistency of maximum likelihood estimates.

Theorem 4.3 (Wald, 1949). Suppose that X, X ,...,X are i.i.d. P , θ Θwithdensity 1 n θ0 0 ∈ p(x, θ0)withrespecttothedominatingmeasureν,andthat: (a) Θis compact. (b) p(x, )isuppersemi-continuousinθ for all x. · (c) There exists a function F (x)suchthatEF(X) < and ∞ f(x, θ) log p(x, θ) log p(x, θ ) F (x) ≡ − 0 ≤ for all x , θ Θ. ∈X ∈ (d) For all θ and all sufficiently small ρ>0

sup p(x, θ&) θ θ <ρ | #− | is measurable in x. (e) p(x, θ)=p(x, θ0)a.e.ν implies that θ = θ0.

Then for any sequence of maximum likelihood estimates θn of θ0,

θn a.s. θ0 . ' → ' Proof. Let ρ>0. The functions f(x, θ):θ Θ satisfy the conditions of theorem 4.2. But { ∈ } we will apply Theorem 4.2 with Θreplaced by the subset

S θ : θ θ ρ Θ . ≡{ | − 0|≥ }⊂ 4. CONSISTENCY OF MAXIMUM LIKELIHOOD ESTIMATORS 29

Then S is compact, and by Theorem 4.2

Pθ0 limsupn sup Pnf( ,θ) sup g(θ) =1 →∞ θ S · ≤ θ S # ∈ ∈ $ where p(X,θ) g(θ)=E f(X,θ)=E log = K(P ,P ) < 0forθ S. θ0 θ0 p(X,θ ) − θ0 θ ∈ ! 0 " Furthermore by the Lemma, g(θ)isuppersemicontinuousandhenceachievesitssupremumonthe compact set S.Letδ =supθ S g(θ). Then by Lemma 4.1.1 and the identifiability assumption (e), it follows that δ<0andwehave∈

Pθ0 limsupn sup Pnf( ,θ) δ =1. →∞ θ S · ≤ # ∈ $ Thus with probability 1 there exists an N such that for all n>N

sup Pnf( ,θ) δ/2 < 0 . θ S · ≤ ∈ But 1 Pnf( , θn)=supPnf( ,θ)=sup ln(θ) ln(θ0) 0 . · θ Θ · θ Θ n { − }≥ ∈ ∈ Hence θ / S 'for n>N;thatis, θ θ <ρwith probability 1. Since ρ was arbitrary, θ is a.s. n ∈ | n − 0| n consistent. 2 ' ' '

Remark 4.1 Theorem 4.3 is due to Wald (1949). The present version is an adaptation of Chapters 16 and 17 of Ferguson (1996). For further Glivenko - Cantelli theorems, see chapter 2.4 of Van der Vaart and Wellner (1996). 30 CHAPTER 4. EFFICIENT LIKELIHOOD ESTIMATION AND RELATED TESTS

5TheEMalgorithm In statistical applications it is a fairly common occurrence that the observations involve “missing data” or “incomplete data”, and this results in complicated likelihood functions for which there is no explicit formula for the MLE. Our goal in this section is to introduce one quite general scheme for maximizing likelihoods, the EM - algorithm,whichisusefulfordealingwithmissingorincomplete data. Suppose that we observe Y Q on ( , )forsomeθ Θ; we assume that Q has density q ∼ θ Y B ∈ θ θ with respect to a dominating measure ν.Thisisthe“observed”or“incompletedata”. On the other hand there is often an X P on ( , )whichhasasimplerlikelihoodorfor ∼ θ X A which the MLE’s can be calculated explicitly, and satisfying Y = T (X). Here we will assume that Pθ has density pθ with respect to µ,andwerefertoX as the “unobserved” or “complete data”. Since Y = T (X), it follows that

1 Q (B)=Q (Y B)=P (T (X) B)=P (X T − (B)), θ θ ∈ θ ∈ θ ∈ or Q = P T 1.Wewanttocompute θ θ ◦ − ˆY θ =argmaxθ log qθ(Y ), but this is often difficult because qθ is complicated. We don’t get to observe X,butifwedid,then often computation of ˆX θ =argmaxθ log pθ(X) is much easier. (0) How to proceed? Choose an initial estimator θ of θ,andhenceaninitialestimatorPθ(0) of Pθ.Thenwewouldproceedviaan

E-step: compute, for θ Θ, ∈

φ0(θ)=φ0(θ,Y )=EP log pθ(X) T (X)=Y . θ(0) { | }

This is the “estimated log-likelihood based on our current guess θ(0) of θ and our observations Y . Then carry out an

M-step: maximize φ0(θ)=φ0(θ,Y )asafunctionofθ to find

(1) θ =argmaxφ0(θ).

Now iterate the above two steps: The following examples illustrate this general scheme.

Example 5.1 (Multinomial). Suppose that Y Mult (n =197,p)where ∼ 4 1 θ 1 θ 1 θ θ p = + , − , − , , 0 <θ<1 . 2 4 4 4 4 # $ 5. THE EM ALGORITHM 31

Therefore n! 1 θ y1 1 θ y2 1 θ y3 θ y4 q (y)= + − − , θ y !y !y !y ! 2 4 4 4 4 1 2 3 4 # $ # $ # $ # $ l(θ Y )=Y log(1/2+θ/4) + (Y + Y )log(1 θ)+Y log θ +constant, | 1 2 3 − 4 1/4 1 1 l˙ (Y )=Y (Y + Y ) + Y , θ 1 1/2+θ/4 − 2 3 1 θ 4 θ − (1/4)2 1 1 ¨l(Y )= Y (Y + Y ) Y , − 1 (1/2+θ/4)2 − 2 3 (1 θ)2 − 4 θ2 − and (1/4)2 1/2 1/4 I(θ)= + + . 1/2+θ/4 (1 θ) θ −

If θn is a preliminary estimator of θ,thenaone-stepestimatorofθ is given by

1 1 θˇ = θ + I(θ )− l˙(θ ) . n n n n n Thus, if Y =(125, 18, 20, 34) is observed, and we take θ =4Y /n =(4 34)/197 = .6904, then the n 4 · one-step estimator is 1 1 θˇ = .6904 + ( 27.03) = .6904 .0663 = .6241 . n 2.07 197 − −

Note that solving the likelihood equation l˙θ(Y )=0involvessolvingacubicequation. Another approach to maximizing l(θ Y )isviatheE-Malgorithm:supposethe“completedata” | is 1 θ 1 θ 1 θ θ (1) X Mult (n, p)withp = , , − , − , . ∼ 5 2 4 4 4 4 # $ so that n! 1 x1 θ x2+x5 1 θ x3+x4 (2) p (x)= − , θ 5 2 4 4 i=1 xi! # $ # $ # $ and the “incomplete@ data” Y is given in terms of the “complete data” X by

(3) Y =(X1 + X2,X3,X4,X5) .

Then the “E - step” of the algorithm is to estimate X given Y (and θ):

1/2 θ/4 (4) E(X Y )= Y ,Y ,Y ,Y ,Y , | 1 1/2+θ/4 1 1/2+θ/4 2 3 4 # $ so we set

(p) 1/2 θ(p)/4 (5) X Y1 ,Y1 ,Y2,Y3,Y4 ≡ 1 1/2+θ(p)/4 1/2+θ(p)/4 2 ' where θ(p) is the estimator of θ at the p th step of the algorithm. − 32 CHAPTER 4. EFFICIENT LIKELIHOOD ESTIMATION AND RELATED TESTS

The “M - step” of the algorithm is to maxmimize the complete data likelihood with X replaced by our estimate of E(X Y ): it is easily seen that | X + X X + X l˙ (X)= 2 5 3 4 =0 θ θ − 1 θ − yields

X2 + X5 θn = X2 + X5 + X3 + X4 as an estimator' of θ based on the full data. Since we can only observe Y ,wetake

X(p) + X (6) θ(p+1) = 2 5 , (p) X + X5 + X3 + X4 2 ' ' 0 (0) and alternative between' (5) and (6) with a “reasonable” guess θn of θ;sayθ =1/2. This yields the following table: ' ' Table 4.1: Iterates of E and M steps in the Multinomial example

(p+1) (p) (p) θ θn p θ θ θn (p) − − θ θn 0 .500000000 -.126821498 .1465b − b 1 .608247423' -.018574075' ' .1346b b 2 .624321051 -.002500447 .1330 3 .626488879 -.000332619 .1328 4 .626777323 -.000044176 .1328 5 .626815632 -.000005866 .1328 6 .626820719 -.000000779 7 .626821395 -.000000104 8 .626821484 -.000000014

The exact root of the likelihood equation which maximizes the likelihood is θn = .62682149....

Example 5.2 (Exponential mixture model). Suppose that Y Q on R+ where' Q has density ∼ θ θ λy µy qθ(y)= pλe− +(1 p)µe− 1(0, )(y) , { − } ∞ and θ =(p, λ, µ) (0, 1) R+2.Considerestimationofθ based on Y ,...,Y i.i.d. q (y). The ∈ × 1 n θ scores for θ based on Y are

λY µY λe− µe− l˙p(Y )= − , qθ(Y ) λY pe− (1 λY ) l˙λ(Y )= − , and qθ(Y ) µY (1 p)e− (1 µY ) l˙µ(Y )= − − . qθ(Y ) 5. THE EM ALGORITHM 33

It is easily seen that E l˙2(Y ) < , E l˙2 (Y ) < ,andE l˙2 (Y ) < ,and,moreover,thatthatthe θ p ∞ θ λ ∞ θ µ ∞ information matrix is nonsingular if λ = µ.However,thelikelihoodequationsarecomplicatedand % do not have “closed form” solutions. Thus we will take an approach based on the EM algorithm. The natural “complete data” for this problem is X =(Y,∆) p (x)where ∼ θ λy δ µy 1 δ p (x)=p (y,δ)=(pλe− ) ((1 p)µe− ) − ,y>0,δ 0, 1 . θ θ − ∈{ } Thus the “incomplete data” Y is just the first coordinate of X.Furthermore,maximumlikelihood estimation of θ in the complete data problem is easy, as can be seen by the following calculations. The log of the density pθ(x)is l(θ X)=δ log p +(1 δ)log(1 p)+δ(log λ λY )+(1 δ)(log µ µY ) | − − − − − so that n ∆ 1 ∆ 1 l˙ (X)= − , p = ∆ , p p − 1 p n i − &i=1 1 1 n ∆ Y l˙ (X)=∆ Y , ' = i=1 i i , λ λ − n ∆ # $ λ - i=1 i 1 1 n (1 ∆ )Y l˙ (X)=(1 ∆) Y , -= i=1 − i i . µ − µ − ' µ n (1 ∆ ) # $ - i=1 − i This gives the “M-step” of the E-M algorithm. To- find the E-step, we compute ' pλe λY E(∆ Y )= − ∆(Y ) p(Y ; θ) | pλe λY +(1 p)µe µY ≡ ≡ − − − since (∆ Y ) Bernoulli(p(Y ; θ)). Thus the E-M' algorithm becomes: θ(m+1) =(p(m+1), λ(m+1), µ(m+1)) | ∼ where n ' ' 1 ' ' p(m+1) = ∆(m), ∆(Y , θ(m)) , n i i &i=1 ' 1 n 'Y ∆(m) ' ' = i=1 i i , (m+1) n (m) λ - i=1 ∆i ' 1 n Y (1 ∆(m)) ' = -i=1 'i − i . µ(m+1) n (1 ∆(m)) - i=1 − 'i - ' ' 34 CHAPTER 4. EFFICIENT LIKELIHOOD ESTIMATION AND RELATED TESTS

6NonparametricMaximumLikelihoodEstimation The maximum likelihood method has also been used successfully in a variety of nonparametric problems. As in section 5 we will begin with several examples.

Example 6.1 Let X1,...,Xn be i.i.d. X valued random variables with common probability − distribution (measure) P on (X, ). For an arbitrary probability distribution (measure) Q on A (X, ), let Q( Xi ) qi.IftherearenotiesintheXi’s, then A { } ≡ n (1) L(Q X)= q L(q X) . | i ≡ | %i=1 Consider maximizing this likelihood as a function of q.ByJensen’sinequalityandconcavityof log x,

n 1 log q log(q) , n i ≤ &i=1 or

n 1/n qi q . / ≤ %i=1 with equality if and only if q = = q = q.Since q 1, q 1/n.Thus 1 ··· n i ≤ ≤ n 1 n - L(q X)= q qn | i ≤ ≤ n %i=1 # $ with equality if and only if qi =1andq1 = = qn =1/n.ThustheMLEofP is Pn with ··· Pn( Xi )=1/n,or { } - n 1 1 Pn(A)= # i n : Xi A = δXi (A) . n { ≤ ∈ } n &i=1 This extends easily to the case of ties (homework!). Thus we have proved:

Theorem 6.1 The nonparametric maximum likelihood estimator of a completely arbitrary distri- 1 n bution P on any measurable space based on i.i.d. data is the empirical measure Pn = n− 1 δXi . 1 n - In particular, if X = R,theempiricaldistributionfunctionFn(x) n− 1 1[Xi x] is the MLE ≡ ≤ of F .RecallfromChapter2thatDonsker’stheoremyields - d √n(Fn F ) = Un(F ) U(F ) − ⇒ where Un is the empirical process of i.i.d. U[0, 1] random variables and U is a Brownian bridge process.

Example 6.2 (Censored data). Suppose that X1,...,Xn are i.i.d. F and Y1,...,Yn are i.i.d. G. Think of the X’s as survival times and the Y ’s as censoring times. Suppose that we can observe 6. NONPARAMETRIC MAXIMUM LIKELIHOOD ESTIMATION 35 only the i.i.d. pairs (Z , ∆ ),...,(Z , ∆ )whereZ X Y and ∆ 1 X Y .Thusthe 1 1 n n i ≡ i ∧ i i ≡ { i ≤ i} joint distribution of (Z, ∆) is given by

Huc(z) P (Z z,∆=1)= (1 G(x ))dF (x) ≡ ≤ − − 9[0,z] where G(x ) limy x G(y), and − ≡ ↑ Hc(z) P (Z z,∆=0)= (1 F (y))dG(y) . ≡ ≤ − 9[0,z] Furthermore, the survival function 1 H(z)=P (Z>z)isgivenby − 1 H(z)=P (Z>z)=P (X>z,Y>z)=(1 F (z))(1 G(z)) . − − − Now suppose for simplicity that Z Z are all distinct. Let ∆ ,...,∆ be the n:1 ≤···≤ n:n n:1 n:n corresponding ∆’s. If we let p F Z = F (Z ) F (Z )=∆F (Z ), and q G Z = i ≡ { n:i} n:i − n:i− n:i i ≡ { n:i} G(Z ) G(Z )=∆G(Z ), i =1,...,n, p =1 F (Z )=1 n p ,andq = n:i − n:i− n:i n+1 − n:n − j=1 j n+1 1 G(Z )=1 n q ,thenanonparametriclikelihoodforthecensoreddataproblemis − n:n − j=1 j - ∆ 1 ∆ 1 ∆ n n-+1 n:i n+1 − n:i n n+1 − n:i ∆ 1 ∆ ∆ (2) p n:i q q − n:i p = p n:i p B i  j i  j i  j × %i=1 &j=i j=&i+1 %i=1 j=&i+1       where B depends only on the qi’s, and hence only on G.Thuswecanfindthenonparametric maximum likelihood estimator of F by maximizing the first term over the pi’s. This is easy once the log-likelihood is re-written in terms of λ p / n+1 p , i =1,...,n+1. i ≡ i j=i j We will first take a different approach by using example 6.1 as follows: if F has density f,then - the hazard function λ is given by λ(t)=f(t)/(1 F (t)), and the cumulative hazard function Λis − t t f(s) t 1 Λ(t)= λ(s)ds = ds = dF (s) . 1 F (s) 1 F (s) 90 90 − 90 − For an arbitrary distribution function F ,itturnsoutthatthe“right”waytodefineΛ,thecumu- lative hazard function corresponding to F ,is: 1 (3) Λ(t)= dF (s) . 1 F (s ) 9[0,t] − − Note that we can write Λas (1 G(s ) 1 Λ(t)= − − dF (s)= dHuc(s) . (1 G(s ))(1 F (s )) 1 H(s ) 9[0,t] − − − − 9[0,t] − − Moreover, we can estimate both Huc and H by their natural nonparametric estimators (from Example 6.1): n n uc 1 1 Hn (z)= 1[Z z,∆ =1], Hn(z)= 1[Z z] . n i≤ i n i≤ &i=1 &i=1 Thus a natural “nonparametric maximum likelihood” estimator of Λis Λn given by t 1 uc (4) Λn(t)= dHn (s) . ' 0 1 Hn(s ) 9 − − ' 36 CHAPTER 4. EFFICIENT LIKELIHOOD ESTIMATION AND RELATED TESTS

It remains only to invert the relationship (3) to obtain an estimator of F .Todothisweneed one more piece of notation: for any nondecreasing, right-continuous function A,wedefinethe continuous part Ac of A by

A (t) A(t) ∆A(s), ∆A(s) A(s) A(s ) . c ≡ − ≡ − − s t &≤ Proposition 6.1 Suppose that Λis the cumulative hazard function corresponding to an arbitrary distribution function F as defined by (3). Then

(5) 1 F (t)=exp( Λ (t)) (1 ∆Λ(s)) (1 dΛ(s)) . − − c − ≡ − s t s t %≤ %≤

Proof. In the case of a continuous distribution function F ,Λisalsocontinuous,Λ=Λc, ∆Λ = 0 identically, and we calculate Λ(t)= log(1 F (t)) so that (5) holds. − − In the case of a purely discrete distribution function F the cumulative hazard function Λis also discrete so that Λ 0and c ≡ ∆F (s) 1 F (s) 1 ∆Λ(s)=1 = − . − − 1 F (s ) 1 F (s ) − − − − Thus 1 F (s ) 1 F (s ) 1 F (s ) (1 ∆Λ(s)) = − 1 − 2 − k − 1 F (s1 ) × 1 F (s2 ) ×···×1 F (s ) s t k %≤ − − − − − − 1 F (s ) 1 F (s ) 1 F (t) = − 1 − 2 − 1 × 1 F (s1) ×···×1 F (sk 1) − − − =1 F (t) − where s1,...,sk are the points of jump of F which are less than or equal to t.Hence(5)alsoholds in this case. For a complete proof of the general case, which relies on rewriting (3) as

t (a) F (t)= (1 F (s ))dΛ(s) − − 90 or equivalently

t (b) 1 F (t)=1 (1 F (s ))dΛ(s) , − − − − 90 see e.g. Liptser and Shiryayev (1978), lemma 18.8, page 255. For a still more general (Doleans- Dade) formula which is valid for martingales, see Shorack and Wellner (1986), page 897. 2

Now we return to the likelihood in (2) with the goal of maximizing it directly. We first use the discrete form of the identities above linking F and Λto re-write (2) in terms of λ p / n+1 p : i ≡ i j=i j note that - 1 ∆n:i n n+1 − n ∆n:i n+1 ∆n:i n+1 p pj p∆n:i p = i j=i p i  j n+1 n+1 j i=1 j=i+1 i=1 1 j=i pj 2 1 -j=i+1 pj 2 j=i+1 % & % &   - - 6. NONPARAMETRIC MAXIMUM LIKELIHOOD ESTIMATION 37

n n+1 ∆n:i ∆ = λ (1 λ )− n:i p i − i  j %i=1 j=&i+1   n+1 p pj since 1 λ =1 i = j=i+1 − i − n+1 n+1 j=i pj - j=i pj n n i ∆n:i ∆ - - = λ (1 λ )− n:i (1 λ ) i − i · − j %i=1 %i=1 j%=1 i n+1 since (1 λ )= p − j j j%=1 j=&i+1 n n ∆n:i ∆ n j+1 = λ (1 λ )− n:i (1 λ ) − i − i − j %i=1 j%=1 n ∆n:i n ∆ i+1 = λ (1 λ ) − n:i− . i − i %i=1 Since each term of this likelihood has the same form as a Binomial likelihood, it is clear that it is maximized by ∆ λˆ = n:i ,i=1,...,n. i n i +1 − Note that this agrees exactly with our estimator Λˆn derived above (in the case of no ties): ∆Λˆn(Zn:i)= ∆ /(n i +1). n:i − The right side of (5) is called the product integral;seeGillandJohansen(1990)forasurvey. It follows from proposition 6.1 that the nonparametric maximum likelihood estimator of F in the case of censored data is the product limit estimator Fn given by

(6) 1 Fn(t)= (1 ∆Λn(s)) − − ' s t %≤ ' ' ∆ = 1 n:i if there are no ties . − n i +1 i:Zn:i t # $ %≤ − This estimator was found by Kaplan and Meier (1958). Breslow and Crowley (1974) proved, using empirical process theory, that

(7) √n(Λn Λ) B(C)inD[0,τ],τ<τH ; − ⇒ and hence that ' d 1 F (8) √n(Fn F ) (1 F )B(C) = − U(K)inD[0,τ] − ⇒ − 1 K − for τ<τH where' B denotes standard Brownian motion, U denotes standard Brownian bridge, and t 1 uc C(t) C(t) 2 dH ,K(t) . ≡ 0 (1 H ) ≡ 1+C(t) 9 − − Note that when there is no censoring and F is continuous K = F and the limit process in (8) becomes just U(F ), the limit process of the usual empirical process. Martingale methods for proving the convergences (7) are due to Gill (1980), (1983); see Shorack and Wellner (1986), chapter 7. 38 CHAPTER 4. EFFICIENT LIKELIHOOD ESTIMATION AND RELATED TESTS

Example 6.3 (Cox’s proportional hazards model and profile likelihood). Suppose that T is a survival time and Z is a covariate vector with values in Rk.Further,supposethat(T Z)has | conditional hazard function

T λ(t z)=eθ zλ(t) . | Here θ Rk and λ is an (unknown) baseline hazard function. Thus ∈ Λ(t z)=exp(θT z)Λ(t) , | and, assuming that F is continuous, with F 1 F , ≡ − T 1 F (t z)=F (t z)=F (t)exp(θ z) , − | | or,

T f(t z)=exp(θT z)F (t)exp(θ z)λ(t) , | If we assume that Z has density h,then

T p(t, z; θ,λ,h)=p(t, z)=exp(θz)F (t)exp(θ z)λ(t)h(z) .

Hence

log p(T,Z; θ,λ,h)=θT Z exp(θT Z)Λ(T )+logλ(T )+logh(Z) . − Suppose that (T ,Z ),...,(T ,Z )arei.i.d.withdensityp.Assumethat0

Thus the profile log-likelihood for θ is given by

n T prof exp(θ Z(i)) 1 (9) l (θ X)=log T n . | . j i exp(θ Z(j)) (ne) / %i=1 ≥ The first factor here is (the log- of) Cox’s partial likelihood for θ;Cox(1972)derivedthisbyother means. Maximizing it over θ yields Cox’s partial likelihood estimator of θ,whichisinfactthe maximum (nonparametric or semiparametric) profile likelihood estimator. Let θ argmax lprof (θ X) . n ≡ θ | it turns' out that this estimator is (asymptotically) efficient; this was proved by Efron (1977) and Begun, Hall, Huang, and Wellner (1983). Furthermore the natural cumulative hazard function estimator is just 1 t 1 Λn(t)= = d n(s) T H exp(θ Z(j)) 0 n(s, θ) T(i) t j i 9 Y &≤ ≥ ' where - ' ' n n 1 1 Hn(t) n− 1[T t], Yn(t, θ) n− 1[T t] exp(θZi) . ≡ i≤ ≡ i≥ &i=1 &i=1 This estimator was derived by Breslow (1972), (1974), and is now commonly called the Breslow estimator of Λ. It is also asymptotically efficient; see Begun, Hall, Huang, and Wellner (1983) and Bickel, Klaassen, Ritov, and Wellner (1993). Although our treatment here has not included right censoring, this can easily be incorporated in this model, and this was one of the key contributions of Cox (1972).

Example 6.4 (Estimation of a concave distribution function and monotone decreasing density). Suppose that the model is all probability distributions P on R+ =[0, )withcorresponding P ∞ distribution functions F which are concave. It follows that the distribution function F correspond- ing to P has a density f and that f is nonincreasing. It was shown by Grenander (1956) that ∈P if X ,...,X are i.i.d. P with distribution function F ,thentheMLEofF over is the least 1 n ∈P P concave majorant Fn of Fn;andthustheMLEfn of f is given by the slope of Fn.SeeBarlow, Bartholomew, Bremner, and Brunk (1972) for this and related results. It was shown by Kiefer and Wolfowitz (1976) that' ' '

√n(Fn F ) U(F ) , − ⇒ and this phenomena' of no improvement or reduction in asymptotic even though the model is a proper subset of all P on R+ is explained by Millar (1979). Prakasa Rao (1969) P M≡{ } showed that if f(t) > 0, then

1/3 1/2 n (fn(t) f(t)) d f(t)f &(t)/2 (2Z) − → | | where Z is the location of the maximum of the process B(t) t2 : t R where B is standard ' { − ∈ } Brownian motion starting from 0; his proof has been greatly simplified and the limit distribution examined in detail by Groeneboom (1984), (1989). These results have been extended to estimation of a monotone density with right-censored data by Huang and Zhang (1994) and Huang and Wellner (1995). 40 CHAPTER 4. EFFICIENT LIKELIHOOD ESTIMATION AND RELATED TESTS

Example 6.5 (Interval censored or “current status” data). Suppose, as in example 6.2, that X1,...,Xn are i.i.d. F and Y1,...,Yn are i.i.d. G,butnowsupposethatweonlyobserve(Yi, ∆i) where ∆i =1[Xi Yi].AgainthegoalistoestimateF ,eventhoughweneverobserveanX directly, but only the indicators≤ ∆. If G has density g,thenthedensityp(y,δ)ofthei.i.d.pairs(Y,δ)is

δ 1 δ p(y,δ)=F (y) (1 F (y)) − g(y) . − if we suppose that there are no ties in the Y ’s, let Y

(i) Plot the points (0, 0), (i, ∆ ),i=1,...,n .Thisiscalledthecumulative sum { j i n:j } diagram. ≤ - (ii) Form H∗(t), the greatest convex minorant of the cumulative sum diagram in (i). (iii) Let P the left derivative of H at i, i =1,...,n. i ≡ ∗ Then P =(' P1,...,Pn)istheuniquevectormaximizingl(P ,q). We define Fn to be the piecewise constant function which equals Pi on the interval (Yn:i,Yn:i+1]. Groeneboom' ' and Wellner' (1992) show that if f(t),g(t) > 0, then ' ' 1/3 1/3 F (t)(1 F (t))f(t) n (Fn(t) F (t)) d − (2Z) − → 2g(t) # $ where Z is the' location of the maximum of the process B(t) t2 : t R as in example 6.4. { − ∈ } For further discussion of the definition of nonparametric maximum likelihood estimators see Kiefer and Wolfowitz (1956) and Scholz (1980). For applications to a mixture model, see Jewell (1982). Other applications of nonparametric maximum likelihod include the work of Vardi (1985) and Gill, Vardi, and Wellner (1988) on biased sampling models. Nonparametric maximum likelihood estimators may be inconsistent; see e.g. Boyles, Marshall, and Proschan (1985), and Barlow, Bartholomew, Bremner, and Brunk (1972), pages 257 - 258. Some progress on the general theory of nonparametric maximum likelihood estimators has been made by Gill (1989), Gill and van der Vaart (1993), and van der Vaart (1995). 7. LIMIT THEORY FOR THE STATISTICAL AGNOSTIC 41

7Limittheoryforthestatisticalagnostic

In the preceding sections we studied the limit behavior of the MLE θn (or ELE θn)underthe assumption that the model is true; i.e. assuming that the data X ,...,X were governed by a P 1 n probability distribution P .Frequentlyhoweverweareinthepositionofnotbeingatallsure' ( θ ∈P that the true P is an element of the model ,anditisnaturaltoaskabouttheasymptoticbehavior P of θ (or θ when, in fact, P/ .Thispointofviewisimplicitintherobustnessliterature,and n n ∈P especially in the work of Huber (1964), (1967), and White (1982). 'We begin( here with a heuristic and rather informal treatment which will then be made rigorous using additional (convexity) hypotheses. For related results, see Pollard (1985), Pakes and Pollard (1989), and Bickel, Klaassen, Ritov and Wellner (1993) appendix A.10 and sections 7.2 - 7.4.

Heuristics for Maximum Likelihood

Suppose (temporarily) that X, X1,...,Xn are i.i.d. P on (X, ), and that A d ρ(x; θ) log p(x; θ),xX,θ Θ R ≡ ∈ ∈ ⊂ is twice continuously differentiable in θ for P a.e. x.Wedo not assume that P = P : − ∈P { θ dP /dµ = p ,θ Θ .Let θ θ ∈ } ψ(x; θ) ρ(x; θ) , ≡∇θ and suppose that Eρ(X ; θ) < and E ψ(X ,θ) 2 < for all θ Θ. 1 ∞ | 1 | ∞ ∈ Suppose that P has density p with respect to a measure µ which also dominates all P , θ Θ. θ ∈ Then the maximum likelihood estimator maximizes n 1 ρ(X ; θ) E ρ(X ; θ)=E log p(X ; θ) n i →a.s. P 1 P 1 &i=1 p(X1) = EP log p(X1) EP log − p(X1; θ) = E log p(X ) K(P, P ) . P 1 − θ Since K(P, P ) 0, the last quantity is maximized by choosing θ to make K(P, P )assmallas θ ≥ θ possible:

sup EP log p(X1) K(P, Pθ) = EP log p(X1) inf K(P, Pθ) θ { − } − θ = E log p(X ) K(P, P ) P 1 − θ0 if we suppose that the infimum is achieved at θ θ (P ). Thus it is natural to expect that (under 0 ≡ 0 reasonable additional conditions)

n 1 θ =argmax ρ(X ; θ) n {n i } &i=1 ' p argmax EP log p(X1) K(P, Pθ0 ) = θ0(P )=argminθ ΘK(P, Pθ) . → { − } ∈ What about a central limt theorem? First note that since

θ0(P )maximizesEP ρ(X1; θ)=Pρ(X1; θ) 42 CHAPTER 4. EFFICIENT LIKELIHOOD ESTIMATION AND RELATED TESTS and n 1 θn maximizes ρ(Xi; θ)=Pnρ(x; θ) , n &i=1 we expect' that

0= E ρ(X ,θ) = E ψ(X ; θ ) ∇θ P 1 |θ=θ0 P 1 0 and

0= θPnρ(X,θ) = Pnψ(X; θn) . ∇ |θ=θn Therefore, by Taylor expansionb of Pnψ(X'; θ)aboutθ0,itfollowsthat

0=Ψn(θn) Pnψ(X; θn) ≡ ˙ =Ψn(θ0)+Ψn(θn∗ )(θn θ0) ' ' − where '

(1) √nΨn(θ0)=√nPnψ( ; θ0) n · 1 (2) = ψ(Xi; θ0) d Z Nd(0,K) √n → ∼ &i=1 with

K E ψ(X ; θ )ψT (X ; θ ) . ≡ P 1 0 1 0 We also have ˙ ˙ ˙ Ψn(θ0)=Pnψ(X; θ0) a.s. EP ψ(X; θ0) , → and hence we also expect to be able to show that

Ψ˙ (θ∗ ) E ψ˙(X ; θ ) Jdd. n →p P 1 0 ≡ × Therefore if J is nonsingular we conclude from (1) that

1 1 1 √n(θn θ0) d J− Z Nd(0,J− K(J− )&) . − → − ∼ Note that if,' in fact P ,thenK = J = Iθ,andtheasymptoticvariance-covariancematrix 1 1 ∈P − J− K(J− )& reduces to just the classical and familiar inverse information matrix.