Lecture Stat 461-561 Wald, Rao and Likelihood Ratio Tests

AD

February 2008

AD() February2008 1/30 Introduction Rao test Likelihood ratio test

AD() February2008 2/30 Introduction

We want to test H0 : θ = θ0 against H1 : θ = θ0 using the log-. 6

We denote l (θ) the loglikelihood and θn the consistent root of the likelihood equation.

Intuitively, the farther θn is from θ0, theb stronger the evidence against the null hypothesis. How far is “far enough”?b

AD() February2008 3/30 Wald Test

We use the fact that under regularity assumptions we have under H0

D 1 pn θn θ0 0, I (θ0) !N    where b ∂2 log f (X θ) I (θ0) = Eθ0 2 j .  ∂θ  This suggests de…ning the following Wald

Wn = nI (θ0) θn θ0 q   b or Wn = nI (θ0) θn θ0 where I (θ0) is a consistent estimate of I (θ ), e.g.qI θ .  0 b n b b   b AD() February2008 4/30 Under H0, we have

D Wn = nI (θ0) θn θ0 (0, 1) !N q   A Wald test is any test thatb rejectsb H0 : θ = θ0 in favor of H1 : θ = θ0 when Wn z 2 where z 2 satis…es 6 j j  α/ α/ P (Z z ) = α/2.  α/2 It follows that by construction Type I error probability is

P ( Wn z 2) P ( Z z 2) = α θ0 j j  α/ ! θ0 j j  α/ and this is an asymptotically size α test. Now consider θ = θ0 then 6

Wn = nI (θ0) θn θ0 = nI (θ0) θn θ + nI (θ0)(θ θ0) q q q   D   P (0,1) ∞ b b b !N b b ! so | {z } | {z } P (reject H0) 1 as n ∞. θ ! !

AD() February2008 5/30 Note that the (approximate) size α Wald test rejects H0 : θ = θ0 in favor of H1 : θ = θ0 if and only if θ0 / C where 6 2

zα/2 zα/2 C = θn , θn + . 0 1 nI (θ0) nI (θ0) @b q b q A Thus testing the hypothesis isb equivalent to checkingb whether the null value is in the con…dence interval.

AD() February2008 6/30 Similarly we can test H0 : θ θ0 against H1 : θ > θ0.  In this case we use Wn = nI (θ0) θn θ0 and reject H0 if Wn zα. q    b b We have if θ = θ0

P (Wn z ) P (Z z ) = α θ  α !  α whereas if θ < θ0 P (Wn z ) 0 θ  α ! and if θ > θ0 P (Wn z ) 1. θ  α !

AD() February2008 7/30 i.i.d. Example: Assume we have Xi Bernoulli(p) and we want to test  H0 : p = p0 against H1 : p = p0. n 6 We have for Sn = ∑i=1 xi

x 1 x f (x p) = p (1 p) l (p) = Sn log p + (n Sn) log (1 p) j ) Sn (n Sn) Sn l 0 (p) = pn = . p 1 p ) n It is easy to check that b

2 D σ0 pn p0 0, !N n   2 where σ0 = p0 (1 p0)b. This quantity can be consistently estimated 2 through σ = pn (1 pn) so the Wald test is based on the statistic n c (pn p0) b bWn = . pn (1 pn) /n b p AD() b b February2008 8/30 Wald test is not limited to MLE estimate, you just need to know the asymptotic distribution of your test satistic.

Example: Assume we have X1, ..., Xm and Y1, ..., Yn be two independent samples from populations with µ1 and v. We write δ = µ µ and we want to test H0 : δ = 0 versus 1 2 H1 : δ = 0. 6 We build X Y W = 2 2 S1 S2 m + m 2 2 q where S1 and S2 are the sample . D Thanks to the CLT, we have W (0, 1) as m, n ∞. !N !

AD() February2008 9/30 Example (Comparing two prediction algorithms): We test a prediction algorithm on a test set of size m and the second prediction algorithm on a second test set of size n. Let X be the number of incorrect predictions for algorithm 1 and Y the number of incorrect prediction for algorithm 2. Then X Binomial(m, p1) and Y Binomial(n, p2) .   To test the null hypothesis H0 : δ = p1 p2 = 0 versus H1 : δ = 0 we 6 note that δ = p1 p2 with estimated

b p1 (1 p1) p2 (1 p2) b b + . r m n b b b b The (approximate) size α test is to reject H0 when

p1 p2 W = j j zα/2 j j p1 (1 p1 ) p2 (1 p2 )  + mb b n q b b b b D as W (0, 1) as m, n ∞ thanks to the CLT. !N !

AD() February2008 10/30 In practice, we often have mispeci…ed models! That is there does not exist any θ0 such that Xi f (x θ0) but Xi g [g being obviously unknown].  j  In this case θ0 is not the ‘true’parameter but the parameter maximizing f (x θ) ∂ log f (x θ) log j .g (x) dx j .g (x) dx = 0 g (x) ) ∂θ Z Z θ=θ0

The Wald test becomes incorrect as we do NOT have

D 1 pn θn θ0 0, I (θ0) !N    To correct it, let’sgo backb to the derivation of the asymptotic normality

0 = l 0 θn l 0 (θ0) + l 00 (θ0) θn θ0  so     b l 0 (θ0) l 0b(θ0) n pn θn θ0 pn =  l 00 (θ0) pn l 00 (θ0)   b AD() February2008 11/30 We have by the law of large numbers n 2 2 l (θ) 1 ∂ log f (Xi θ) P ∂ log f (x θ) 00 = j j .g (x) dx n n ∑ 2 2 i=1 ∂θ ! Z ∂θ We also have n l (θ) 1 ∂ log f (Xi θ) 0 = ∑ j pn pn i=1 ∂θ where ∂ log f (X θ) ∂ log f (x θ) 2 var j = j .g (x) dx ∂θ ∂θ   Z ∂ log f (x θ) 2 j .g (x) dx ∂θ Z  So by the CLT

2 l (θ0) D ∂ log f (x θ) 0 0, j .g (x) dx pn !N ∂θ ! Z θ=θ0

AD() February2008 12/30 By Slutzky’slemma

2 ∂ log f ( x θ) ∂θ j .g (x) dx p 0 θ0 1 n θn θ0 0, 2 . !N R ∂2 log f ( x θ) B 2 j .g (x) dx C   ∂θ b B θ0 C B   C @ R A The asymptotic can be estimated consistently from the . We can derive a Wald test based on this asymptotic variance.

AD() February2008 13/30 This suggests developing a test for misspeci…cation as

∂ log f (x θ) 2 ∂2 log f (x θ) j .f (x θ) dx + j .f (x θ) dx = 0 ∂θ j 2 j Z Z ∂θ  So we can propose the following test statistic

n 2 n 2 1 ∂ log f (xi θ) 1 ∂ log f (xi θ) T = ∑ j + ∑ 2 j n i=1 ∂θ n i=1 ∂θ θn θn

b which under H0: model speci…ed is b asymptotically Gaussian with zero-mean and variance which can be estimated consistently.

AD() February2008 14/30 Rao Test

We have under H0 : θ = θ0 1 D l 0 (θ0) (0, I (θ0)) pn !N ∂ log L( θ x) where l 0 (θ) = ∂θ j . To prove this, remember that

0 = l 0 θn l 0 (θ0) + l 00 (θ0) θn θ0      thus b b 1 l 00 (θ0) l 00 (θ0) l 0 (θ0) θn θ0 = pn θn θ0 pn  pn n     where b b

l 00 (θ0) P D 1 I (θ0) and pn θn θ0 0, I (θ0) . n ! !N    The result follows from Slutzky’slemma.b AD() February2008 15/30 This suggests de…ning the following Rao statistic

l 0 (θ0) Rn = nI (θ0)

which converges in distribution top a standard normal under H0. A major advantage of the score is that it does not require computing θn. We could also replace I (θ0) by a consistent estimate

I (θ0). However using I θn would defeat the purpose of avoiding to b compute θn.   b b A score test -also called Rao score test- is any test that rejects b H0 : θ = θ0 in favor of H1 : θ = θ0 when Rn z 2. 6 j j  α/

AD() February2008 16/30 i.i.d. Example: Assume we have Xi Bernoulli(p) and we want to test  H0 : p = p0 against H1 : p = p0. n 6 We have for Sn = ∑i=1 xi

x 1 x f (x p) = p (1 p) l (λ) = Sn log p + (n Sn) log (1 p) j ) Sn (n Sn) Sn l 0 (p) = pn = . p 1 p ) n Sn (n Sn) l 00 (p) = b p2 (1 p)2 It follows that

n (pn p) 1 l 0 (p) = , I (p) = p (1 p) p (1 p) b and l 0 (p0) pn p0 Rn = = . nI (p0) p0 (1 p0) /n b p p AD() February2008 17/30 In practice, for complex models, one would use

l 0 (θ0) Rn = nI (θ0) q where n 2 b 1 ∂ log f (xi θ) P I (θ0) = j I (θ0) n ∑ ∂θ2 ! i=1 θ0 which is consistentb thanks to Slutzky.

However, once more one has to be careful for misspeci…ed models

AD() February2008 18/30 Indeed we have

1 l 00 (θ0) l 00 (θ0) l 0 (θ0) θn θ0 = pn θn θ0 pn  pn n     where b b 2 l (θ0) P ∂ log f (x θ) 00 j .g (x) dx n ! ∂θ2 Z θ=θ0 and

∂ log f ( x θ) 2 ∂θ j .g (x) dx D pn θ θ 00, θ=θ0 1 n 0 R 2 !N ∂2 log f ( x θ)   B 2 j .g (x ) dx C B ∂θ θ=θ C b B  0  C @ A so R 2 1 D ∂ log f (x θ) l 0 (θ0) 0, j .g (x) dx pn !N ∂θ Z θ0 !

AD() February2008 19/30 Thus if you use Rao test for misspeci…ed model, you have to use the following consistent estimate of the asymptotic variance

n 2 2 1 ∂ log f (xi θ) D ∂ log f (x θ) j j .g (x) dx n ∑ ∂θ ! ∂θ i=1 θ0 Z θ0

n 2 1 ∂ log f ( xi θ ) If you use the standard estimate j , then you will n ∑ ∂θ2 θ0 i=1 get a wrong result.

AD() February2008 20/30 Likelihood Ratio Test

The LR test is based on supL ( θ x) θ Θ j ∆n = l θn l (θ0) = log 2 0. 0 L ( θ0 x) 1    j b @ A If we perform a Taylor expansion of l 0 θn = 0 around θ0 then   l 0 θ0 = θn θ0 lb00 (θ0) + o (1) . (1)      By performing anotherb Taylorb expansion of l θn around θ0 then 1 2  ∆n = θn θ0 l 0 (θ0) + θn θ0 bl 00 (θ0) + o (1) . (2) 2 Hence, by substituting  (1) in (2), then    b b 2 1 1 1 ∆n = n θn θ0 l 00 (θ0) + l 00 (θ0) + o . n 2n n     

AD() b February2008 21/30 By Slutsky’stheorem, this implies that under H0

D 2 2∆n χ . ! 1 2 2 Noting that the 1 α quantile of a χ1 distribution is zα/2, we can now de…ne the LR test.

A LR test is any test that rejects H0 : θ = θ0 in favor of H1 : θ = θ0 2 6 when 2∆n zα/2 or, equivalently, when p2∆n zα/2; i.e. we reject for small values of the LR. 

AD() February2008 22/30 i.i.d. Example: Assume we have Xi (λ) and we want to test  P H0 : λ = λ0 against H1 : λ = λ0. n 6 We have for Sn = ∑i=1 xi

λx n f (x λ) = exp ( λ) l (λ) = nλ + Sn log λ ∑ xi ! j x! ) i=1 ! Sn Sn l 0 (λ) = n + , l 00 (λ) = . λ λ2

Sn It follows that λn = n and

b λn ∆n = n λ0 λn + Sn log . λ0   b b

AD() February2008 23/30 Test equivalences

We can show that (when there is no misspeci…cation)

P Rn Wn, ! 2 P W 2∆n. n ! The tests are thus asymptotically equivalent in the sense that under H0 they reach the same decision with probability 1 as n ∞. ! For a …nite sample size n, they have some relative advantages and disadvantages with respect to one another.

AD() February2008 24/30 It is easy to create one-sided Wald and score tests.

The score test does not require θn whereas the other two tests do. The Wald test is most easily interpretable and yields immediate con…dence intervals. b The score test and LR test are invariant under reparametrization, whereas the Wald test is not.

AD() February2008 25/30 i.i.d. Example: Assume Xi f (x θ) where  f (x θ) = θ exp ( xθ) I x >j0 . Then j f g l (θ) = n log θ θX n which yields  1 n l 0 (θ) = n X n and l 00 (θ) = . θ 2   θ We also obtain 1 2 θn = , I (θ) = θ . X n It follows that b pn 1 Wn = θ0 , θ0 X  n  1 Wn Rn = θ0pn X n = , θ0 X   θ0 n ∆n = n X n X n θ0 log θ0X n .    AD() February2008 26/30 i.i.d. Example: Assume Xi f (x θ) where  j θ (θ+1) f (x θ) = θc x I x > c (Pareto distribution) j f g where c is a known constant and θ is unknown. n We have for Sn = ∑i=1 log (xi )

l (θ) = n (log θ + θ log c) (θ + 1) Sn, 1 n l 0 (θ) = n + log c Sn, l 00 (θ) = . θ 2   θ Thus we have θ = n , I (θ) = θ 2 and it follows that n Sn n log c pn n Wn = b θ0 , θ0 Sn n log c   1 Rn = pnθ0 + log c Sn , θ0    θn ∆n = n log + θn θ0 log c θn θ0 Sn. θ0 ! b     b b AD() February2008 27/30 Multivariate Generalizations

When θ Rd , then the Wald, Rao and LR tests can be straightforwardly2 extended

T D 2 Wn : = n θn θ0 I (θ0) θn θ0 χ , ! d 1 T 1  D 2 Rn : = bl (θ0) I (θ0) bl (θ0) χ , n r r ! d D 1 2 ∆n : = l θn l (θ0) χ ! 2 d   d b 2 Therefore, if cα denotes the 1 α quantile of the χd distribution, d d d then we reject H0 when Wn c , Rn c and 2∆n c .  α  α  α As in the scalar case, in the Wald test, we can subsitute to I (θ0) a consistent estimate.

AD() February2008 28/30 Example: Assume we observe a vector X = (X1, ..., Xk ) where k Xj 0, 1 , ∑ Xj = 1 with 2 f g j=1 x k 1 k 1 k xj f (x p1, ..., pk 1) = ∏ pj 1 ∑ pj j j=1 ! j=1 !

k 1 where pj > 0 and pk := 1 ∑ pj < 1. We have j=1 θ = (p1, ..., pk 1). We have for n observations X 1, X 2, ..., X n

k k 1 k 1 l (θ) = ∑ tj log pj = ∑ tj log pj + tk log 1 ∑ pj , j=1 j=1 j=1 ! 2 ∂l (θ) tj tk ∂ l (θ) tj tk = , 2 = 2 2 , ∂pj pj pk ∂pj pj pk 2 ∂ l (θ) tk = 2 , j = l < k. ∂pj ∂pl pk 6 n i where tj = ∑i=1 xj .

AD() February2008 29/30 Recall that Xj has a Bernoulli distribution with mean pj so 1 1 1 1 p1 + pk pk pk 1 1 1 . 2 pk p2 + pk . 3 I (θ) = . 1 . 6 . p . 7 6 k 7 6 . . 1 1 7 6 . . pk 1 + pk 7 6 7 It follows that 4 5

Wn = n θn θ0 I (θ0) θn θ0 , 1 T 1   Rn = bl (θ0) I (θ0) b l (θ0) . n r r After tiedous calculations, it can be shown that k 2 (tj npi ) Wn = Rn = ∑ j=1 npi which is the usual Pearson chi-square test whereas k pj ∆n = n t log . ∑j=1 j pj

AD() b February2008 30/30