Wald (And Score) Tests
Total Page:16
File Type:pdf, Size:1020Kb
Wald (and Score) Tests 1 / 18 Vector of MLEs is Asymptotically Normal That is, Multivariate Normal This yields I Confidence intervals I Z-tests of H0 : θj = θ0 I Wald tests I Score Tests I Indirectly, the Likelihood Ratio tests 2 / 18 Under Regularity Conditions (Thank you, Mr. Wald) a.s. I θbn → θ √ d −1 I n(θbn − θ) → T ∼ Nk 0, I(θ) 1 −1 I So we say that θbn is asymptotically Nk θ, n I(θ) . I I(θ) is the Fisher Information in one observation. I A k × k matrix ∂2 I(θ) = E[− log f(Y ; θ)] ∂θi∂θj I The Fisher Information in the whole sample is nI(θ) 3 / 18 H0 : Cθ = h Suppose θ = (θ1, . θ7), and the null hypothesis is I θ1 = θ2 I θ6 = θ7 1 1 I 3 (θ1 + θ2 + θ3) = 3 (θ4 + θ5 + θ6) We can write null hypothesis in matrix form as θ1 θ2 1 −1 0 0 0 0 0 θ3 0 0 0 0 0 0 1 −1 θ4 = 0 1 1 1 −1 −1 −1 0 θ5 0 θ6 θ7 4 / 18 p Suppose H0 : Cθ = h is True, and Id(θ)n → I(θ) By Slutsky 6a (Continuous mapping), √ √ d −1 0 n(Cθbn − Cθ) = n(Cθbn − h) → CT ∼ Nk 0, CI(θ) C and −1 p −1 Id(θ)n → I(θ) . Then by Slutsky’s (6c) Stack Theorem, √ ! n(Cθbn − h) d CT → . −1 I(θ)−1 Id(θ)n Finally, by Slutsky 6a again, −1 0 0 −1 Wn = n(Cθb − h) (CId(θ)n C ) (Cθb − h) →d W = (CT − 0)0(CI(θ)−1C0)−1(CT − 0) ∼ χ2(r) 5 / 18 The Wald Test Statistic −1 0 0 −1 Wn = n(Cθbn − h) (CId(θ)n C ) (Cθbn − h) I Again, null hypothesis is H0 : Cθ = h I Matrix C is r × k, r ≤ k, rank r I All we need is a consistent estimator of I(θ) I I(θb) would do I But it’s inconvenient I Need to compute partial derivatives and expected values in ∂2 I(θ) = E[− log f(Y ; θ)] ∂θi∂θj 6 / 18 Observed Fisher Information I To find θbn, minimize the minus log likelihood. I Matrix of mixed partial derivatives of the minus log likelihood is " n # ∂2 ∂2 X − `(θ, Y) = − log f(Y ; θ) ∂θ ∂θ ∂θ ∂θ i i j i j i=1 I So by the Strong Law of Large Numbers, " n # 1 X ∂2 J (θ) = − log f(Y ; θ) n n ∂θ ∂θ i i=1 i j ∂2 a.s.→ E − log f(Y ; θ) = I(θ) ∂θi∂θj 7 / 18 A Consistent Estimator of I(θ) Just substitute θbn for θ " n # 1 X ∂2 J n(θbn) = − log f(Yi; θ) n ∂θi∂θj i=1 θ=θbn ∂2 a.s.→ E − log f(Y ; θ) = I(θ) ∂θi∂θj I Convergence is believable but not trivial. I Now we have a consistent estimator, more convenient than I(θbn): Use Id(θ)n = J n(θbn) 8 / 18 Approximate the Asymptotic Covariance Matrix 1 −1 I Asymptotic covariance matrix of θbn is n I(θ) . I Approximate it with 1 −1 Vb n = J n(θbn) n !−1 1 1 ∂2 = − `(θ, Y) n n ∂θ ∂θ i j θ=θbn !−1 ∂2 = − `(θ, Y) ∂θ ∂θ i j θ=θbn 9 / 18 Compare Hessian and (Estimated) Asymptotic Covariance Matrix −1 h 2 i V = − ∂ `(θ, Y) I b n ∂θi∂θj θ=θbn h 2 i Hessian at MLE is H = − ∂ `(θ, Y) I ∂θi∂θj θ=θbn I So to estimate the asymptotic covariance matrix of θ, just invert the Hessian. I The Hessian is usually available as a by-product of numerical search for the MLE. 10 / 18 Connection to Numerical Optimization I Suppose we are minimizing the minus log likelihood by a direct search. I We have reached a point where the gradient is close to zero. Is this point a minimum? I The Hessian is a matrix of mixed partial derivatives. If all its eigenvalues are positive at a point, the function is concave up there. I Its the multivariable second derivative test. I The Hessian at the MLE is exactly the observed Fisher information matrix. I Partial derivatives are often approximated by the slopes of secant lines – no need to calculate them. 11 / 18 So to find the estimated asymptotic covariance matrix I Minimize the minus log likelihood numerically. I The Hessian at the place where the search stops is exactly the observed Fisher information matrix. I Invert it to get Vb n. I This is so handy that sometimes we do it even when a closed-form expression for the MLE is available. 12 / 18 Estimated Asymptotic Covariance Matrix Vb n is Useful I Asymptotic standard error of θbj is the square root of the jth diagonal element. I Denote the asymptotic standard error of θj by S . b θbj I Thus θbj − θj Z = j S θbj is approximately standard normal. 13 / 18 Confidence Intervals and Z-tests θbj −θj Have Zj = S approximately standard normal, yielding θbj I Confidence intervals: θj ± S z b θbj α/2 I Test H0 : θj = θ0 using θbj − θ0 Z = S θbj 14 / 18 And Wald Tests 1 −1 Recalling Vb n = n J n(θbn) −1 0 0 −1 Wn = n(Cθbn − h) (CId(θ)n C ) (Cθbn − h) 0 −1 0 −1 = n(Cθbn − h) (CJ n(θbn) C ) (Cθbn − h) −1 0 0 = n(Cθbn − h) C(nVb n)C (Cθbn − h) −1 0 1 0 = n(Cθbn − h) CVb nC (Cθbn − h) n −1 0 0 = (Cθbn − h) CVb nC (Cθbn − h) 15 / 18 Score Tests Thank you Mr. Rao I θb is the MLE of θ, size k × 1 I θb0 is the MLE under H0, size k × 1 u(θ) = ( ∂` ,... ∂` )0 is the gradient. I ∂θ1 ∂θk I u(θb) = 0 I If H0 is true, u(θb0) should also be close to zero. I Under H0 for large N, u(θb0) ∼ Nk(0, J (θ)), approximately. I And, 0 −1 2 S = u(θb0) J (θb0) u(θb0) ∼ χ (r) Where r is the number of restrictions imposed by H0 16 / 18 Three Big Tests I Score Tests: Fit just the restricted model I Wald Tests: Fit just the unrestricted model I Likelihood Ratio Tests: Fit Both 17 / 18 Comparing Likelihood Ratio and Wald I Asymptotically equivalent under H0, meaning p (Wn − Gn) → 0 I Under H1, I Both have approximately the same distribution (non-central chi-square) I Both go to infinity as n → ∞ I But values are not necessarily close I Likelihood ratio test tends to get closer to the right Type I error rate for small samples. I Wald can be more convenient when testing lots of hypotheses, because you only need to fit the model once. I Wald can be more convenient if it’s a lot of work to write the restricted likelihood. 18 / 18.