Chapter 4 Efficient Likelihood Estimation and Related Tests
Total Page:16
File Type:pdf, Size:1020Kb
1 Chapter 4 Efficient Likelihood Estimation and Related Tests 1. Maximum likelihood and efficient likelihood estimation 2. Likelihood ratio, Wald, and Rao (or score) tests 3. Examples 4. Consistency of Maximum Likelihood Estimates 5. The EM algorithm and related methods 6. Nonparametric MLE 7. Limit theory for the statistical agnostic: P/ ∈P 2 Chapter 4 Efficient Likelihood Estimation and Related Tests 1Maximumlikelihoodandefficientlikelihoodestimation We begin with a brief discussion of Kullback - Leibler information. Definition 1.1 Let P be a probability measure, and let Q be a sub-probability measure on (X, ) A with densities p and q with respect to a sigma-finite measure µ (µ = P + Q always works). Thus P (X)=1andQ(X) 1. Then the Kullback - Leibler information K(P, Q)is ≤ p(X) (1) K(P, Q) E log . ≡ P q(X) ! " Lemma 1.1 For a probability measure Q and a (sub-)probability measure Q,theKullback-Leibler information K(P, Q)isalwayswell-defined,and [0, ]always K(P, Q) ∈ ∞ =0 ifandonlyifQ = P. ! Proof. Now log 1 = 0 if P = Q, K(P, Q)= log M>0ifP = MQ, M > 1 . ! If P = MQ,thenJensen’sinequalityisstrictandyields % q(X) K(P, Q)=E log P − p(X) # $ q(X) > log E = log E 1 − P p(X) − Q [p(X)>0] # $ log 1 = 0 . ≥− 2 Now we need some assumptions and notation. Suppose that the model is given by P = P : θ Θ . P { θ ∈ } 3 4 CHAPTER 4. EFFICIENT LIKELIHOOD ESTIMATION AND RELATED TESTS We will impose the following hypotheses about : P Assumptions: A0. θ = θ implies P = P . % ∗ θ % θ∗ A1. A x : p (x) > 0 does not depend on θ. ≡{ θ } A2. P has density p with respect to the σ finite measure µ and X ,...,X are i.i.d. P P . θ θ − 1 n θ0 ≡ 0 Notation: n L(θ) L (θ) L(θ X) p (X ) , ≡ n ≡ | ≡ θ i i=1 % n l(θ)=l(θ X) l (θ) log L (θ)= log p (X ) , | ≡ n ≡ n θ i &i=1 l(B) l(B X) ln(B)=supl(θ X) . ≡ | ≡ θ B | ∈ Here is a preliminary result which motivates our definition of the maximum likelihood estimator. Theorem 1.1 If A0 - A2 hold, then for θ = θ % 0 n 1 Ln(θ0) 1 pθ0 (Xi) log = log a.s. K(Pθ0 ,Pθ) > 0 , n Ln(θ) n pθ(Xi) → # $ &i=1 and hence P (L (θ X) >L (θ X)) 1asn . θ0 n 0| n | → →∞ Proof. The first assertion is just the strong law of large numbers; note that pθ0 (X) Eθ0 log = K(Pθ0 ,Pθ) > 0 pθ(X) by lemma 1.1 and A0. The second assertion is an immediate consequence of the first. 2 Theorem 1.1 motivates the following definition. Definition 1.2 The value θ = θ of θ which maximizes the likelihood L(θ X), if it exists and is n | unique, is the maximum likelihood estimator (MLE) of θ.ThusL(θ)=L(Θ) or l(θn)=l(Θ). ' ' Cautions: ' ' θ may not exist. • n θ may exist, but may not be unique. • 'n Note that the definition depends on the version of the density p which is selected; since this • ' θ is not unique, different versions of pθ lead to different MLE’s 1. MAXIMUM LIKELIHOOD AND EFFICIENT LIKELIHOOD ESTIMATION 5 When Θ Rd,theusualapproachtofindingθ is to solve the likelihood (or score)equations ⊂ n (2) l˙(θ X) l˙ (θ)=0; | ≡ n ' i.e. l˙ (θ X)=0,i =1,...,d.Thesolutionθ say, may not be the MLE, but may yield simply a θi | n local maximum of l(θ). The likelihood ratio statistic for testing H( : θ = θ versus K : θ = θ is 0 % 0 L(Θ) supθ Θ L(θ X) L(θn) λn = = ∈ | = , L(θ0) L(θ0 X) L(θ0) | ' L(θn) λn = . L(θ0) ( Write P(, E for P , E .Herearesomemoreassumptionsaboutthemodel which we will use 0 0 θ0 θ0 P to treat these estimators and test statistics. Assumptions, continued: A3. ΘcontainsanopenneighborhoodΘ Rd of θ for which: 0 ⊂ 0 (i) For µ a.e. x, l(θ x) log p (x)istwicecontinuouslydifferentiableinθ. | ≡ θ ··· ··· (ii) For a.e. x,thethirdorderderivativesexistand l jkl (θ x)satisfy l jkl (θ x) Mjkl(x) | | | |≤ for θ Θ for all 1 j, k, l d with E M (X) < . ∈ 0 ≤ ≤ 0 jkl ∞ A4. (i) E l˙ (θ X) =0forj =1,...,d. 0{ j 0| } (ii) E l˙2(θ X) < for j =1,...,d. 0{ j 0| } ∞ (iii) I(θ )=( E ¨l (θ X) )ispositivedefinite. 0 − 0{ jk 0| } Let n 1 1 Z l˙(θ X )andl(θ X)=I− (θ )l˙(θ X) , n ≡ √n 0| i 0| 0 0| &i=1 so that ( n 1 1 I− (θ )Z = l(θ X ) . 0 n √n 0| i &i=1 ( Theorem 1.2 Suppose that X ,...,X are i.i.d. P with density p where satisfies A0 - 1 n θ0 ∈P θ0 P A4. Then: (i) With probability converging to 1 there exist solutions θn of the likelihood equations such that θn p θ0 when P0 = Pθ0 is true. → ( (ii) θ is asymptotically linear with influence function l(θ x). That is, n( 0| n ( 1 1( √n(θ θ )=I− (θ )Z + o (1) = l(θ X )+o (1) n − 0 0 n p √n 0| i p i=1 1 &1 ( I− (θ )Z D N (0,I− (θ()) . →d 0 ≡ ∼ d 0 6 CHAPTER 4. EFFICIENT LIKELIHOOD ESTIMATION AND RELATED TESTS (iii) T 1 T 2 2logλ Z I− (θ )Z = D I(θ )D χ . n →d 0 0 ∼ d (iv) ( T T T 1 2 W √n(θ θ ) I (θ )√n(θ θ ) D I(θ )D = Z I− (θ )Z χ , n ≡ n − 0 n n n − 0 →d 0 0 ∼ d where ( ' ( ( I(θn) , or 1 n ˙ ˙ T In(θn)= n− i=1 l(θn Xi)l(θn Xi) , or ( 1 n ¨ | | n− i=1 l(θn Xi) . ' ( − - ( | ( - (v) ( T 1 T 1 2 R Z I− (θ )Z Z I− (θ )Z χ . n ≡ n 0 n → 0 ∼ d Here we could replace I(θ0)byanyofthepossibilitiesforIn(θn)givenin(iv)andthecon- clusion continues to hold. ' ( (vi) The model satisfies the LAN condition at θ : P 0 1/2 T 1 T l(θ + n− t) l(θ )=t Z t I(θ )t + o (1) 0 − 0 n − 2 0 P0 1 tT Z tT I(θ )t N( (1/2)σ2,σ2) →d − 2 0 ∼ − 0 0 where σ2 tT I(θ )t.Notethat 0 ≡ 0 1/2 √n(θˆn θ0)=tˆn =argmaxln(θ0 + n− t) ln(θ0) − { T T − } 1 d argmax t Z (1/2)t I(θ0)t = I− (θ0)Z → {1 − } N (0,I− (θ )). ∼ d 0 Remark 1.1 Note that the asymptotic form of the log-likelihood given in part (vi) of theorem 1.2 is exactly the log-likelihood ratio for a normal mean model Nd(I(θ0)t, I(θ0)). Also note that T 1 T 1 T 1 1 1 T 1 t Z t I(θ )t = Z I− (θ )Z (t I− (θ )Z) I(θ )(t I− (θ )Z) , − 2 0 2 0 − 2 − 0 0 − 0 1 T 1 which is maximized as a function of t by t = I− (θ0)Z with maximum value Z I− (θ0)Z/2. Corollary 1 Suppose that A0-A4 hold and that ν ν(P )=q(θ)isdifferentiableatθ Θ. ' ≡ θ 0 ∈ Then ν q(θ )satisfies n ≡ n n 1 T 1 (√n(˜ν ( ν )= ˜l (θ X )+o (1) N(0, q˙ (θ )I− (θ )˙q(θ )) . n − 0 √n ν 0| i p →d 0 0 0 &i=1 where l (θ X )=q ˙T (θ )I 1(θ )l˙(θ X )andν q(θ ). ν 0| i 0 − 0 0| i 0 ≡ 0 ( 1. MAXIMUM LIKELIHOOD AND EFFICIENT LIKELIHOOD ESTIMATION 7 If the likelihood equations (2) are difficult to solve or have multiple roots, then it is possible to use a one-step approximation. Suppose that θn is a preliminary estimator of θ and set 1 1 (3) θˇ θ + I− (θ )(n− l˙(θ X)) . n ≡ n n n n| ˇ The estimator θn is' sometimes called a one-step estimator. Theorem 1.3 Suppose that A0-A4 hold, and that θ satisfies n1/4(θ θ )=o (1); note that the n n − 0 p latter holds if √n(θ θ )=O (1). Then n − 0 p 1 1 √n(θˇ θ )=I− (θ )Z + o (1) N (0,I− (θ )) n − 0 0 n p →d d 0 where Z n 1/2 n l˙(θ X ). n ≡ − i=1 0| i - Proof. Theorem 1.2.(i)Existenceandconsistency.Fora>0, let Q θ Θ: θ θ = a . a ≡{ ∈ | − 0| } We will show that (a) P l(θ) <l(θ )forallθ Q 1asn . 0{ 0 ∈ a}→ →∞ This implies that L has a local maximum inside Qa.Sincethelikelihoodequationsmustbesatisfied at a local maximum, it will follow that for any a>0withprobabilityconvergingto1thatthe likelihood equations have a solution θn(a)withinQa;takingtherootclosesttoθ0 completes the proof. To prove (a), write ( 1 1 1 1 (l(θ) l(θ )) = (θ θ )T l˙(θ ) (θ θ )T ¨l(θ ) (θ θ ) n − 0 n − 0 0 − 2 − 0 −n 0 − 0 # $ d d d n 1 + (θ θ )(θ θ )(θ θ ) γ (X )M (X ) 6n j − j0 k − k0 l − l0 jkl i jkl i &j=1 &k=1 &l=1 &i=1 (b) = S1 + S2 + S3 where, by A3(ii), 0 γ (x) 1.