Wald (and Score) Tests
1 / 18 Vector of MLEs is Asymptotically Normal That is, Multivariate Normal
This yields
I Confidence intervals
I Z-tests of H0 : θj = θ0
I Wald tests
I Score Tests
I Indirectly, the Likelihood Ratio tests
2 / 18 Under Regularity Conditions (Thank you, Mr. Wald)
a.s. I θbn → θ √ d −1 I n(θbn − θ) → T ∼ Nk 0, I(θ) 1 −1 I So we say that θbn is asymptotically Nk θ, n I(θ) . I I(θ) is the Fisher Information in one observation.
I A k × k matrix ∂2 I(θ) = E[− log f(Y ; θ)] ∂θi∂θj
I The Fisher Information in the whole sample is nI(θ)
3 / 18 H0 : Cθ = h
Suppose θ = (θ1, . . . θ7), and the null hypothesis is
I θ1 = θ2
I θ6 = θ7 1 1 I 3 (θ1 + θ2 + θ3) = 3 (θ4 + θ5 + θ6) We can write null hypothesis in matrix form as θ1 θ2 1 −1 0 0 0 0 0 θ3 0 0 0 0 0 0 1 −1 θ4 = 0 1 1 1 −1 −1 −1 0 θ5 0 θ6 θ7
4 / 18 p Suppose H0 : Cθ = h is True, and Id(θ)n → I(θ) By Slutsky 6a (Continuous mapping),
√ √ d −1 0 n(Cθbn − Cθ) = n(Cθbn − h) → CT ∼ Nk 0, CI(θ) C
and −1 p −1 Id(θ)n → I(θ) . Then by Slutsky’s (6c) Stack Theorem,
√ ! n(Cθbn − h) d CT → . −1 I(θ)−1 Id(θ)n Finally, by Slutsky 6a again,
−1 0 0 −1 Wn = n(Cθb − h) (CId(θ)n C ) (Cθb − h) →d W = (CT − 0)0(CI(θ)−1C0)−1(CT − 0) ∼ χ2(r)
5 / 18 The Wald Test Statistic −1 0 0 −1 Wn = n(Cθbn − h) (CId(θ)n C ) (Cθbn − h)
I Again, null hypothesis is H0 : Cθ = h
I Matrix C is r × k, r ≤ k, rank r
I All we need is a consistent estimator of I(θ)
I I(θb) would do
I But it’s inconvenient
I Need to compute partial derivatives and expected values in ∂2 I(θ) = E[− log f(Y ; θ)] ∂θi∂θj
6 / 18 Observed Fisher Information
I To find θbn, minimize the minus log likelihood.
I Matrix of mixed partial derivatives of the minus log likelihood is
" n # ∂2 ∂2 X − `(θ, Y) = − log f(Y ; θ) ∂θ ∂θ ∂θ ∂θ i i j i j i=1
I So by the Strong Law of Large Numbers,
" n # 1 X ∂2 J (θ) = − log f(Y ; θ) n n ∂θ ∂θ i i=1 i j ∂2 a.s.→ E − log f(Y ; θ) = I(θ) ∂θi∂θj
7 / 18 A Consistent Estimator of I(θ)
Just substitute θbn for θ
" n # 1 X ∂2 J n(θbn) = − log f(Yi; θ) n ∂θi∂θj i=1 θ=θbn ∂2 a.s.→ E − log f(Y ; θ) = I(θ) ∂θi∂θj
I Convergence is believable but not trivial.
I Now we have a consistent estimator, more convenient than
I(θbn): Use Id(θ)n = J n(θbn)
8 / 18 Approximate the Asymptotic Covariance Matrix
1 −1 I Asymptotic covariance matrix of θbn is n I(θ) .
I Approximate it with
1 −1 Vb n = J n(θbn) n !−1 1 1 ∂2 = − `(θ, Y) n n ∂θ ∂θ i j θ=θbn !−1 ∂2 = − `(θ, Y) ∂θ ∂θ i j θ=θbn
9 / 18 Compare Hessian and (Estimated) Asymptotic Covariance Matrix
−1 h 2 i V = − ∂ `(θ, Y) I b n ∂θi∂θj θ=θbn h 2 i Hessian at MLE is H = − ∂ `(θ, Y) I ∂θi∂θj θ=θbn I So to estimate the asymptotic covariance matrix of θ, just invert the Hessian.
I The Hessian is usually available as a by-product of numerical search for the MLE.
10 / 18 Connection to Numerical Optimization
I Suppose we are minimizing the minus log likelihood by a direct search.
I We have reached a point where the gradient is close to zero. Is this point a minimum?
I The Hessian is a matrix of mixed partial derivatives. If all its eigenvalues are positive at a point, the function is concave up there.
I Its the multivariable second derivative test.
I The Hessian at the MLE is exactly the observed Fisher information matrix.
I Partial derivatives are often approximated by the slopes of secant lines – no need to calculate them.
11 / 18 So to find the estimated asymptotic covariance matrix
I Minimize the minus log likelihood numerically.
I The Hessian at the place where the search stops is exactly the observed Fisher information matrix.
I Invert it to get Vb n.
I This is so handy that sometimes we do it even when a closed-form expression for the MLE is available.
12 / 18 Estimated Asymptotic Covariance Matrix Vb n is Useful
I Asymptotic standard error of θbj is the square root of the jth diagonal element.
I Denote the asymptotic standard error of θj by S . b θbj I Thus θbj − θj Z = j S θbj is approximately standard normal.
13 / 18 Confidence Intervals and Z-tests
θbj −θj Have Zj = S approximately standard normal, yielding θbj
I Confidence intervals: θj ± S z b θbj α/2 I Test H0 : θj = θ0 using
θbj − θ0 Z = S θbj
14 / 18 And Wald Tests 1 −1 Recalling Vb n = n J n(θbn)
−1 0 0 −1 Wn = n(Cθbn − h) (CId(θ)n C ) (Cθbn − h) 0 −1 0 −1 = n(Cθbn − h) (CJ n(θbn) C ) (Cθbn − h) −1 0 0 = n(Cθbn − h) C(nVb n)C (Cθbn − h) −1 0 1 0 = n(Cθbn − h) CVb nC (Cθbn − h) n −1 0 0 = (Cθbn − h) CVb nC (Cθbn − h)
15 / 18 Score Tests Thank you Mr. Rao
I θb is the MLE of θ, size k × 1 I θb0 is the MLE under H0, size k × 1 u(θ) = ( ∂` ,... ∂` )0 is the gradient. I ∂θ1 ∂θk
I u(θb) = 0 I If H0 is true, u(θb0) should also be close to zero. I Under H0 for large N, u(θb0) ∼ Nk(0, J (θ)), approximately.
I And, 0 −1 2 S = u(θb0) J (θb0) u(θb0) ∼ χ (r)
Where r is the number of restrictions imposed by H0
16 / 18 Three Big Tests
I Score Tests: Fit just the restricted model
I Wald Tests: Fit just the unrestricted model
I Likelihood Ratio Tests: Fit Both
17 / 18 Comparing Likelihood Ratio and Wald
I Asymptotically equivalent under H0, meaning p (Wn − Gn) → 0 I Under H1,
I Both have approximately the same distribution (non-central chi-square)
I Both go to infinity as n → ∞ I But values are not necessarily close
I Likelihood ratio test tends to get closer to the right Type I error rate for small samples.
I Wald can be more convenient when testing lots of hypotheses, because you only need to fit the model once.
I Wald can be more convenient if it’s a lot of work to write the restricted likelihood.
18 / 18