Score Function and Fisher Information 27

2.2 Score Function and Fisher Information 27 2.2 Score Function and Fisher Information The MLE of θ is obtained by maximising the (relative) likelihood function, θML arg max L(θ) arg max L(θ). ˆ = θ Θ = θ Θ ˜ ∈ ∈ For numerical reasons, it is often easier to maximise the log-likelihood l(θ) = log L(θ) or the relative log-likelihood l(θ) l(θ) l(θ ) (cf. Sect. 2.1), which ˜ = − ˆML yields the same result since θML arg max l(θ) arg max l(θ). ˆ = θ Θ = θ Θ ˜ ∈ ∈ However, the log-likelihood function l(θ) has much larger importance, besides sim- plifying the computation of the MLE. Especially, its first and second derivatives are important and have their own names, which are introduced in the following. For simplicity, we assume that θ is a scalar. Definition 2.6 (Score function) The first derivative of the log-likelihood function dl(θ) S(θ) = dθ is called the score function. ! Computation of the MLE is typically done by solving the score equation S(θ) 0. = The second derivative, the curvature, of the log-likelihood function is also of central importance and has its own name. Definition 2.7 (Fisher information) The negative second derivative of the log- likelihood function d2l(θ) dS(θ) I(θ) = − dθ 2 = − dθ is called the Fisher information.ThevalueoftheFisherinformationattheMLE θ ,i.e.I(θ ),istheobserved Fisher information. ˆML ˆML ! Note that the MLE θ is a function of the observed data, which explains the ˆML terminology “observed” Fisher information for I(θ ). ˆML Example 2.9 (Normal model) Suppose we have realisations x1 n of a random sample from a normal distribution N(µ, σ 2) with unknown mean :µ and known vari- 28 2Likelihood ance σ 2.Thelog-likelihoodkernelandscorefunctionarethen n 1 2 l(µ) (xi µ) and = −2σ 2 − i 1 != 1 n S(µ) (xi µ), = σ 2 − i 1 != respectively. The solution of the score equation S(µ) 0istheMLEµML x. Taking another derivative gives the Fisher information = ˆ =¯ n I(µ) , = σ 2 which does not depend on µ and so is equal to the observed Fisher information I(µML),nomatterwhattheactualvalueofµML is. ˆSuppose we switch the roles of the two parametersˆ and treat µ as known and σ 2 as unknown. We now obtain n 2 2 σ (xi µ) /n ˆML = − i 1 != with Fisher information n 2 1 2 n I σ (xi µ) . = σ 6 − − 2σ 4 i 1 " # != The Fisher information of σ 2 now really depends on its argument σ 2.Theobserved Fisher information turns out to be 2 n I σML 4 . ˆ = 2σML " " # ˆ It is instructive at this stage to adopt a frequentist point of view and to consider the MLE µ x from Example 2.9 as a random variable, i.e. µ X is now ˆ ML =¯ ˆ ML = ¯ afunctionoftherandomsampleX1 n.WecantheneasilycomputeVar(µML) Var(X) σ 2/n and note that : ˆ = ¯ = 1 Var(µ ) ˆ ML = I(µ ) ˆ ML holds. In Sect. 4.2.3 we will see that this equality is approximately valid for other statistical models. Indeed, under certain regularity conditions, the variance Var(θ ) ˆML of the MLE turns out to be approximately equal to the inverse observed Fisher information 1/I (θ ),andtheaccuracyofthisapproximationimproveswithincreasing ˆML sample size n. Example 2.9 is a special case, where this equality holds exactly for any sample size. 2.2 Score Function and Fisher Information 29 Example 2.10 (Binomial model) The score function of a binomial observation X x with X Bin(n, π) is = ∼ dl(π) x n x S(π) − = dπ = π − 1 π − and has been derived already in Example 2.1.TakingthederivativeofS(π) gives the Fisher information d2l(π) dS(π) I(π) = − dπ 2 = − dπ x n x − = π 2 + (1 π)2 − x/n (n x)/n n − . = π 2 + (1 π)2 $ − % Plugging in the MLE π x/n,wefinallyobtaintheobservedFisherinformation ˆ ML = n I(π ) . ˆ ML = π (1 π ) ˆ ML − ˆ ML This result is plausible if we take a frequentist point of view and consider the MLE as a random variable. Then X 1 1 π(1 π) Var(π ) Var Var(X) nπ(1 π) − , ˆ ML = n = n2 · = n2 − = n & ' so the variance of π has the same form as the inverse observed Fisher information; ˆ ML the only difference is that the MLE πML is replaced by the true (and unknown) parameter π.TheinverseobservedFisherinformationishenceanestimateoftheˆ variance of the MLE. " How does the observed Fisher information change if we reparametrise our statistical model? Here is the answer to this question. Result 2.1 (Observed Fisher information after reparametrisation) Let I (θ ) de- θ ˆML note the observed Fisher information of a scalar parameter θ and suppose that φ h(θ) is a one-to-one transformation of θ. The observed Fisher information I =(φ ) of φ is then φ ˆML 1 2 2 dh− (φML) dh(θML) − Iφ(φML) Iθ (θML) ˆ Iθ (θML) ˆ . (2.3) ˆ = ˆ dφ = ˆ dθ $ % $ % 1 Proof The transformation h is assumed to be one-to-one, so θ h− (φ) and 1 = lφ(φ) lθ h (φ) .Applicationofthechainrulegives = { − } 30 2Likelihood 1 dlφ(φ) dlθ h− (φ) Sφ(φ) { } = dφ = dφ 1 dlθ (θ) dh− (φ) = dθ · dφ 1 dh− (φ) Sθ (θ) . (2.4) = · dφ The second derivative of lφ(φ) can be computed using the product and chain rules: 1 dSφ(φ) d dh− (φ) Iφ(φ) Sθ (θ) = − dφ = −dφ · dφ $ % 1 2 1 dSθ (θ) dh− (φ) d h− (φ) Sθ (θ) = − dφ · dφ − · dφ2 1 2 2 1 dSθ (θ) dh− (φ) d h− (φ) Sθ (θ) = − dθ · dφ − · dφ2 $ % 1 2 2 1 dh− (φ) d h− (φ) Iθ (θ) Sθ (θ) . = dφ − · dφ2 $ % Evaluating Iφ(φ) at the MLE φ φML (so θ θML)leadstothefirstequationin(2.3) = ˆ = ˆ (note that Sθ (θML) 0). The second equation follows with ˆ = 1 1 dh (φ) dh(θ) − dh(θ) − for 0. (2.5) dφ = dθ dθ ̸= $ % # Example 2.11 (Binomial model) In Example 2.6 we saw that the MLE of the odds ω π/(1 π) is ω x/(n x).WhatisthecorrespondingobservedFisher = − ˆ ML = − information? First, we compute the derivative of h(π) π/(1 π),whichis = − dh(π) 1 . dπ = (1 π)2 − Using the observed Fisher information of π derived in Example 2.10,weobtain 2 dh(πML) − n 4 Iω(ωML) Iπ (πML) ˆ (1 πML) ˆ = ˆ dπ = π (1 π ) · − ˆ $ % ˆ ML − ˆ ML (1 π )3 (n x)3 n − ˆ ML − . = · π = nx ˆ ML 2.3 Numerical Computation of the Maximum Likelihood Estimate 31 As a function of x for fixed n,theobservedFisherinformationIω(ωML) is monotonically decreasing (the numerator is monotonically decreasing, andˆ the denominator is monotonically increasing). In other words, the observed Fisher information in- creases with decreasing MLE ωML. The observed Fisher informationˆ of the log odds φ log(ω) can be similarly computed, and we obtain = 2 3 2 1 − (n x) x x(n x) Iφ(φML) Iω(ωML) − − . ˆ = ˆ ω = nx · (n x)2 = n & ˆ ML ' − Note that I (φ ) does not change if we redefine successes as failures and vice φ ˆML versa. This is also the case for the observed Fisher information Iπ (πML) but not for ˆ Iω(ωML). ˆ " 2.3 Numerical Computation of the Maximum Likelihood Estimate Explicit formulas for the MLE and the observed Fisher information can typically only be derived in simple models. In more complex models, numerical techniques have to be applied to compute maximum and curvature of the log-likelihood function. We first describe the application of general purpose optimisation algorithms to this setting and will discuss the Expectation-Maximisation (EM) algorithm in Sect. 2.3.2. 2.3.1 Numerical Optimisation Application of the Newton–Raphson algorithm (cf. Appendix C.1.3)requiresthe first two derivatives of the function to be maximised, so for maximising the log- likelihood function, we need the score function and the Fisher information. Iterative application of the equation (t) (t 1) (t) S(θ ) θ + θ = + I(θ (t)) gives after convergence (i.e. θ (t 1) θ (t))theMLEθ .Asaby-product,theob- + ˆML served Fisher information I(θ ) can= also be extracted. ˆML To apply the Newton–Raphson algorithm in R,thefunctionoptim can conve- niently be used, see Appendix C.1.3 for details. We need to pass the log-likelihood function as an argument to optim.Explicitlypassingthescorefunctionintooptim typically accelerates convergence. If the derivative is not available, it can sometimes be computed symbolically using the R function deriv.Generallynoderivatives need to be passed to optim because it can approximate them numerically. Particu- larly useful is the option hessian = TRUE,inwhichcaseoptim will also return the negative observed Fisher information..

Score Function and Fisher Information 27

Analysis of Variance; *******43*1T**T

6 Modeling Survival Data with Cox Regression Models

18.650 (F16) Lecture 10: Generalized Linear Models (Glms)

UC Riverside UC Riverside Previously Published Works

Statistical Estimation in Multivariate Normal Distribution

Optimal Variance Control of the Score Function Gradient Estimator for Importance Weighted Bounds

Bayesian Methods: Review of Generalized Linear Models

Simple, Lower-Variance Gradient Estimators for Variational Inference

A Sufficiency Paradox: an Insufficient Statistic Preserving the Fisher

STAT 830 Likelihood Methods

A Note on Inference in a Bivariate Normal Distribution Model Jaya

Risk, Scores, Fisher Information, and Glrts (Supplementary Material for Math 494) Stanley Sawyer — Washington University Vs