Stat 8931 (Aster Models) Lecture Slides Deck 4 [1Ex] Large Sample
Total Page:16
File Type:pdf, Size:1020Kb
Stat 8931 (Aster Models) Lecture Slides Deck 4 Large Sample Theory and Estimating Population Growth Rate Charles J. Geyer School of Statistics University of Minnesota October 8, 2018 R and License The version of R used to make these slides is 3.5.1. The version of R package aster used to make these slides is 1.0.2. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License (http://creativecommons.org/licenses/by-sa/4.0/). The Delta Method The delta method is a method (duh!) of deriving the approximate distribution of a nonlinear function of an estimator from the approximate distribution of the estimator itself. What it does is linearize the nonlinear function. If g is a nonlinear, differentiable vector-to-vector function, the best linear approximation, which is the Taylor series up through linear terms, is g(y) − g(x) ≈ rg(x)(y − x); where rg(x) is the matrix of partial derivatives, sometimes called the Jacobian matrix. If gi (x) denotes the i-th component of the vector g(x), then the (i; j)-th component of the Jacobian matrix is @gi (x)=@xj . The Delta Method (cont.) The delta method is particularly useful when θ^ is an estimator and θ is the unknown true (vector) parameter value it estimates, and the delta method says g(θ^) − g(θ) ≈ rg(θ)(θ^ − θ) It is not necessary that θ and g(θ) be vectors of the same dimension. Hence it is not necessary that rg(θ) be a square matrix. The Delta Method (cont.) The delta method gives good or bad approximations depending on whether the spread of the distribution of θ^ − θ is small or large compared to the nonlinearity of the function g in the neighborhood of θ. The Taylor series approximation the delta method uses is a good approximation for sufficiently small values of θ^ − θ and a bad approximation for sufficiently large values of θ^ − θ. So the overall method is good if those \sufficiently large" values have small probability. And bad otherwise. The Delta Method (cont.) As with nearly every application of approximation in statistics, we rarely (if ever) do the (very difficult) analysis to know whether the approximation is good or bad. We just use the delta method and hope it gives good results. If we are really worried, we can check it using simulation (also called the parametric bootstrap). This method will be illustrated in Deck 7 of the course slides. The Delta Method (cont.) The delta method is particularly easy to use when the distribution of θ^ − θ is multivariate normal, exactly or approximately. If it is only approximately normal, then this is another approximation in addition to the Taylor series approximation. The reason this is easy is that normal distributions are determined by their mean vector and variance matrix, and there is a theorem which gives the mean vector and variance matrix of a linear function of a random vector. The Delta Method (cont.) Theorem Suppose X is a random vector, a is a nonrandom vector, and B is a nonrandom matrix such that a + BX makes sense (because a, B, and X have dimensions such that the indicated vector addition and matrix-vector multiplication are defined). Then E(a + bX ) = a + BE(X ) var(a + bX ) = B var(X )BT A proof is given on slides 64{67 of deck 2 of my Stat 5101 course slides. Another way to say this is that if E(X ) = µ and var(X ) = V , then E(a + bX ) = a + Bµ var(a + bX ) = BVBT The Delta Method (cont.) So suppose θ^ is normal with mean vector θ and variance matrix V , and write B = rg(θ), then θ^ − θ has mean vector 0 and variance matrix V , and Eg(θ^) − g(θ) ≈ 0 varg(θ^) − g(θ) ≈ BVBT The Delta Method (cont.) The Delta Method for Approximately Normal Estimators. Suppose θ^ is approximately normal with mean vector θ and variance matrix V (θ). Suppose g is a vector-to-vector function with derivative rg(θ) = B(θ). Then g(θ^) is approximately normal with mean vector g(θ) and variance matrix B(θ)V (θ)B(θ)T . The Delta Method (cont.) An approximate confidence region for g(θ) is centered at g(θ^) and has extent determined by B(θ)V (θ)B(θ)T . But we do not know that because we do not know θ (the true unknown parameter value). Thus we make a last approximation and plug-in θ^ for θ in the variance and use B(θ^)V (θ^)B(θ^)T . This is known as the plug-in principle. (For the statisticians in the audience, it is an application of Slutsky's theorem.) The Delta Method (cont.) Recall from deck 2 of these slides that the maximum likelihood estimator in an unconditional canonical affine submodel of an aster model can be written β^ = h−1(MT y) where h is the transformation from canonical to mean value parameters given by T h(β) = rcsub(β) = M rc(a + Mβ) and has derivative 2 T 2 rh(β) = r csub(β) = M r c(a + Mβ)M The Delta Method (cont.) And by the inverse function theorem of real analysis, the derivative of the inverse function is the (matrix) inverse of the derivative of the forward function rh−1(τ) = rh(β)−1; when τ = h(β) and β = h−1(τ): Fisher Information The matrix that appeared in the derivative of the canonical-to-mean-value parameter map plays a very important role in likelihood inference. The observed Fisher information matrix is minus the second derivative matrix of the log likelihood. The expected Fisher information matrix is the expectation of the observed Fisher information matrix. Fisher Information (cont.) What Fisher information is depends on what the parameter is (what you are differentiating with respect to). It also depends on what the model is (what the log likelihood is). Thus, to be pedantically correct, we need decoration to indicate observed or expected, the model, and the parameter Sometimes we are not so fussy and let the context indicate what we mean. Fisher Information (cont.) For log likelihood l for parameter ', observed Fisher information (for this model and parameter) is 2 Iobs(') = −∇ l(') and expected Fisher information (for this model and parameter) is 2 Iexp(') = E'fIobs(')g = E'{−∇ l(')g Fisher Information (cont.) If this is the log likelihood for a regular full exponential family l(') = hy;'i − c('); then 2 2 Iobs(') = −∇ l(') = r c(') and since this is a nonrandom quantity, it is its own expectation (expectation of a constant is that constant), so 2 Iexp(') = r c(') too. Fisher Information (cont.) Thus for a regular full exponential family, in general, and for saturated aster models and their unconditional canonical affine submodels, in particular, there is no difference between observed and expected Fisher information for the unconditional canonical parameter, and we can just write I (') = r2c(') Fisher Information (cont.) But even restricting to Fisher information for the unconditional canonical parameter, we distinguish Fisher information for saturated models and canonical affine submodels 2 Isat(') = r c(') 2 Isub(β) = r csub(β) = MT r2c(a + Mβ)M Fisher Information (cont.) To figure out Fisher information for other parameters, there are two ways to go: Write the log likelihood in terms of the new parameter, differentiate it twice, negate it, and take an expectation, if expected Fisher information is wanted. Prove a theorem about how Fisher information transforms under change-of-parameter. (The latter is just the former done abstractly and once and for all, rather than concretely and repeated for each problem.) Fisher Information Transforms by Covariance If is another parameter, then @l( ) X @l(') @'k = @ i k @'k @ i (the multivariable chain rule), and 2 2 2 @ l( ) X X @ l(') @'k @'l X @l(') @ 'k = + @ i @ j k l @'k @'l @ i @ j k @'k @ i @ j This is somewhat ugly. But if we plug in the MLE for ', the second term is zero because rl(' ^) = 0 (the first derivative is zero at the maximum). The second term also goes away for expected Fisher information because E'frl(')g = 0 by a differentiation under the integral sign argument proved in theoretical statistics courses (slides 33{35 and 86 of my 5102 course slides). Fisher Information Transforms by Covariance (cont.) This gives the tranformation rules T Iexp; ( ) = B( ) Iexp;'(')B( ) where ' = h( ) B( ) = rh( ) and T Iobs; ( ^) = B( ^) Iobs;'(' ^)B( ^) with the same conditions and' ^ = h( ^). Fisher Information and MLE The so-called \usual" asymptotics of maximum likelihood says the asymptotic (large sample, approximate) distribution of the MLE is normal with mean vector the true unknown parameter value and variance inverse Fisher information (either observed or expected, but for that particular model and parameter). For regular full exponential families, this is an application of the delta method. Fisher Information and MLE (cont.) Recall again (from just before we started talking about Fisher information) for a unconditional canonical affine submodel of an aster model β^ = h−1(MT y) where T h(β) = rcsub(β) = M rc(a + Mβ) 2 T rh(β) = r csub(β) = M rc(a + Mβ)M and rh−1(τ) = rh(β)−1; when τ = h(β) and β = h−1(τ): Fisher Information and MLE (cont.) The mean vector and variance matrix of the submodel canonical statistic EfMT yg = MT µ varfMT yg = MT r2c(a + Mβ)M = I (β) (the latter is the submodel Fisher information matrix for β). Assume (more on this later) that the distribution is approximately multivariate normal with this mean vector and variance matrix.