Score Function and Fisher Information 27

Total Page:16

File Type:pdf, Size:1020Kb

Score Function and Fisher Information 27 2.2 Score Function and Fisher Information 27 2.2 Score Function and Fisher Information The MLE of θ is obtained by maximising the (relative) likelihood function, θML arg max L(θ) arg max L(θ). ˆ = θ Θ = θ Θ ˜ ∈ ∈ For numerical reasons, it is often easier to maximise the log-likelihood l(θ) = log L(θ) or the relative log-likelihood l(θ) l(θ) l(θ ) (cf. Sect. 2.1), which ˜ = − ˆML yields the same result since θML arg max l(θ) arg max l(θ). ˆ = θ Θ = θ Θ ˜ ∈ ∈ However, the log-likelihood function l(θ) has much larger importance, besides sim- plifying the computation of the MLE. Especially, its first and second derivatives are important and have their own names, which are introduced in the following. For simplicity, we assume that θ is a scalar. Definition 2.6 (Score function) The first derivative of the log-likelihood function dl(θ) S(θ) = dθ is called the score function. ! Computation of the MLE is typically done by solving the score equation S(θ) 0. = The second derivative, the curvature, of the log-likelihood function is also of central importance and has its own name. Definition 2.7 (Fisher information) The negative second derivative of the log- likelihood function d2l(θ) dS(θ) I(θ) = − dθ 2 = − dθ is called the Fisher information.ThevalueoftheFisherinformationattheMLE θ ,i.e.I(θ ),istheobserved Fisher information. ˆML ˆML ! Note that the MLE θ is a function of the observed data, which explains the ˆML terminology “observed” Fisher information for I(θ ). ˆML Example 2.9 (Normal model) Suppose we have realisations x1 n of a random sam- ple from a normal distribution N(µ, σ 2) with unknown mean :µ and known vari- 28 2Likelihood ance σ 2.Thelog-likelihoodkernelandscorefunctionarethen n 1 2 l(µ) (xi µ) and = −2σ 2 − i 1 != 1 n S(µ) (xi µ), = σ 2 − i 1 != respectively. The solution of the score equation S(µ) 0istheMLEµML x. Taking another derivative gives the Fisher information = ˆ =¯ n I(µ) , = σ 2 which does not depend on µ and so is equal to the observed Fisher information I(µML),nomatterwhattheactualvalueofµML is. ˆSuppose we switch the roles of the two parametersˆ and treat µ as known and σ 2 as unknown. We now obtain n 2 2 σ (xi µ) /n ˆML = − i 1 != with Fisher information n 2 1 2 n I σ (xi µ) . = σ 6 − − 2σ 4 i 1 " # != The Fisher information of σ 2 now really depends on its argument σ 2.Theobserved Fisher information turns out to be 2 n I σML 4 . ˆ = 2σML " " # ˆ It is instructive at this stage to adopt a frequentist point of view and to consider the MLE µ x from Example 2.9 as a random variable, i.e. µ X is now ˆ ML =¯ ˆ ML = ¯ afunctionoftherandomsampleX1 n.WecantheneasilycomputeVar(µML) Var(X) σ 2/n and note that : ˆ = ¯ = 1 Var(µ ) ˆ ML = I(µ ) ˆ ML holds. In Sect. 4.2.3 we will see that this equality is approximately valid for other statistical models. Indeed, under certain regularity conditions, the variance Var(θ ) ˆML of the MLE turns out to be approximately equal to the inverse observed Fisher infor- mation 1/I (θ ),andtheaccuracyofthisapproximationimproveswithincreasing ˆML sample size n. Example 2.9 is a special case, where this equality holds exactly for any sample size. 2.2 Score Function and Fisher Information 29 Example 2.10 (Binomial model) The score function of a binomial observation X x with X Bin(n, π) is = ∼ dl(π) x n x S(π) − = dπ = π − 1 π − and has been derived already in Example 2.1.TakingthederivativeofS(π) gives the Fisher information d2l(π) dS(π) I(π) = − dπ 2 = − dπ x n x − = π 2 + (1 π)2 − x/n (n x)/n n − . = π 2 + (1 π)2 $ − % Plugging in the MLE π x/n,wefinallyobtaintheobservedFisherinformation ˆ ML = n I(π ) . ˆ ML = π (1 π ) ˆ ML − ˆ ML This result is plausible if we take a frequentist point of view and consider the MLE as a random variable. Then X 1 1 π(1 π) Var(π ) Var Var(X) nπ(1 π) − , ˆ ML = n = n2 · = n2 − = n & ' so the variance of π has the same form as the inverse observed Fisher information; ˆ ML the only difference is that the MLE πML is replaced by the true (and unknown) parameter π.TheinverseobservedFisherinformationishenceanestimateoftheˆ variance of the MLE. " How does the observed Fisher information change if we reparametrise our statis- tical model? Here is the answer to this question. Result 2.1 (Observed Fisher information after reparametrisation) Let I (θ ) de- θ ˆML note the observed Fisher information of a scalar parameter θ and suppose that φ h(θ) is a one-to-one transformation of θ. The observed Fisher information I =(φ ) of φ is then φ ˆML 1 2 2 dh− (φML) dh(θML) − Iφ(φML) Iθ (θML) ˆ Iθ (θML) ˆ . (2.3) ˆ = ˆ dφ = ˆ dθ $ % $ % 1 Proof The transformation h is assumed to be one-to-one, so θ h− (φ) and 1 = lφ(φ) lθ h (φ) .Applicationofthechainrulegives = { − } 30 2Likelihood 1 dlφ(φ) dlθ h− (φ) Sφ(φ) { } = dφ = dφ 1 dlθ (θ) dh− (φ) = dθ · dφ 1 dh− (φ) Sθ (θ) . (2.4) = · dφ The second derivative of lφ(φ) can be computed using the product and chain rules: 1 dSφ(φ) d dh− (φ) Iφ(φ) Sθ (θ) = − dφ = −dφ · dφ $ % 1 2 1 dSθ (θ) dh− (φ) d h− (φ) Sθ (θ) = − dφ · dφ − · dφ2 1 2 2 1 dSθ (θ) dh− (φ) d h− (φ) Sθ (θ) = − dθ · dφ − · dφ2 $ % 1 2 2 1 dh− (φ) d h− (φ) Iθ (θ) Sθ (θ) . = dφ − · dφ2 $ % Evaluating Iφ(φ) at the MLE φ φML (so θ θML)leadstothefirstequationin(2.3) = ˆ = ˆ (note that Sθ (θML) 0). The second equation follows with ˆ = 1 1 dh (φ) dh(θ) − dh(θ) − for 0. (2.5) dφ = dθ dθ ̸= $ % # Example 2.11 (Binomial model) In Example 2.6 we saw that the MLE of the odds ω π/(1 π) is ω x/(n x).WhatisthecorrespondingobservedFisher = − ˆ ML = − information? First, we compute the derivative of h(π) π/(1 π),whichis = − dh(π) 1 . dπ = (1 π)2 − Using the observed Fisher information of π derived in Example 2.10,weobtain 2 dh(πML) − n 4 Iω(ωML) Iπ (πML) ˆ (1 πML) ˆ = ˆ dπ = π (1 π ) · − ˆ $ % ˆ ML − ˆ ML (1 π )3 (n x)3 n − ˆ ML − . = · π = nx ˆ ML 2.3 Numerical Computation of the Maximum Likelihood Estimate 31 As a function of x for fixed n,theobservedFisherinformationIω(ωML) is monoton- ically decreasing (the numerator is monotonically decreasing, andˆ the denominator is monotonically increasing). In other words, the observed Fisher information in- creases with decreasing MLE ωML. The observed Fisher informationˆ of the log odds φ log(ω) can be similarly computed, and we obtain = 2 3 2 1 − (n x) x x(n x) Iφ(φML) Iω(ωML) − − . ˆ = ˆ ω = nx · (n x)2 = n & ˆ ML ' − Note that I (φ ) does not change if we redefine successes as failures and vice φ ˆML versa. This is also the case for the observed Fisher information Iπ (πML) but not for ˆ Iω(ωML). ˆ " 2.3 Numerical Computation of the Maximum Likelihood Estimate Explicit formulas for the MLE and the observed Fisher information can typically only be derived in simple models. In more complex models, numerical techniques have to be applied to compute maximum and curvature of the log-likelihood func- tion. We first describe the application of general purpose optimisation algorithms to this setting and will discuss the Expectation-Maximisation (EM) algorithm in Sect. 2.3.2. 2.3.1 Numerical Optimisation Application of the Newton–Raphson algorithm (cf. Appendix C.1.3)requiresthe first two derivatives of the function to be maximised, so for maximising the log- likelihood function, we need the score function and the Fisher information. Iterative application of the equation (t) (t 1) (t) S(θ ) θ + θ = + I(θ (t)) gives after convergence (i.e. θ (t 1) θ (t))theMLEθ .Asaby-product,theob- + ˆML served Fisher information I(θ ) can= also be extracted. ˆML To apply the Newton–Raphson algorithm in R,thefunctionoptim can conve- niently be used, see Appendix C.1.3 for details. We need to pass the log-likelihood function as an argument to optim.Explicitlypassingthescorefunctionintooptim typically accelerates convergence. If the derivative is not available, it can sometimes be computed symbolically using the R function deriv.Generallynoderivatives need to be passed to optim because it can approximate them numerically. Particu- larly useful is the option hessian = TRUE,inwhichcaseoptim will also return the negative observed Fisher information..
Recommended publications
  • Analysis of Variance; *********43***1****T******T
    DOCUMENT RESUME 11 571 TM 007 817 111, AUTHOR Corder-Bolz, Charles R. TITLE A Monte Carlo Study of Six Models of Change. INSTITUTION Southwest Eduoational Development Lab., Austin, Tex. PUB DiTE (7BA NOTE"". 32p. BDRS-PRICE MF-$0.83 Plug P4istage. HC Not Availablefrom EDRS. DESCRIPTORS *Analysis Of COariance; *Analysis of Variance; Comparative Statistics; Hypothesis Testing;. *Mathematicai'MOdels; Scores; Statistical Analysis; *Tests'of Significance ,IDENTIFIERS *Change Scores; *Monte Carlo Method ABSTRACT A Monte Carlo Study was conducted to evaluate ,six models commonly used to evaluate change. The results.revealed specific problems with each. Analysis of covariance and analysis of variance of residualized gain scores appeared to, substantially and consistently overestimate the change effects. Multiple factor analysis of variance models utilizing pretest and post-test scores yielded invalidly toF ratios. The analysis of variance of difference scores an the multiple 'factor analysis of variance using repeated measures were the only models which adequately controlled for pre-treatment differences; ,however, they appeared to be robust only when the error level is 50% or more. This places serious doubt regarding published findings, and theories based upon change score analyses. When collecting data which. have an error level less than 50% (which is true in most situations)r a change score analysis is entirely inadvisable until an alternative procedure is developed. (Author/CTM) 41t i- **************************************4***********************i,******* 4! Reproductions suppliedby ERRS are theibest that cap be made * .... * '* ,0 . from the original document. *********43***1****t******t*******************************************J . S DEPARTMENT OF HEALTH, EDUCATIONWELFARE NATIONAL INST'ITUTE OF r EOUCATIOp PHISDOCUMENT DUCED EXACTLYHAS /3EENREPRC AS RECEIVED THE PERSONOR ORGANIZATION J RoP AT INC.
    [Show full text]
  • 6 Modeling Survival Data with Cox Regression Models
    CHAPTER 6 ST 745, Daowen Zhang 6 Modeling Survival Data with Cox Regression Models 6.1 The Proportional Hazards Model A proportional hazards model proposed by D.R. Cox (1972) assumes that T z1β1+···+zpβp z β λ(t|z)=λ0(t)e = λ0(t)e , (6.1) where z is a p × 1 vector of covariates such as treatment indicators, prognositc factors, etc., and β is a p × 1 vector of regression coefficients. Note that there is no intercept β0 in model (6.1). Obviously, λ(t|z =0)=λ0(t). So λ0(t) is often called the baseline hazard function. It can be interpreted as the hazard function for the population of subjects with z =0. The baseline hazard function λ0(t) in model (6.1) can take any shape as a function of t.The T only requirement is that λ0(t) > 0. This is the nonparametric part of the model and z β is the parametric part of the model. So Cox’s proportional hazards model is a semiparametric model. Interpretation of a proportional hazards model 1. It is easy to show that under model (6.1) exp(zT β) S(t|z)=[S0(t)] , where S(t|z) is the survival function of the subpopulation with covariate z and S0(t)isthe survival function of baseline population (z = 0). That is R t − λ0(u)du S0(t)=e 0 . PAGE 120 CHAPTER 6 ST 745, Daowen Zhang 2. For any two sets of covariates z0 and z1, zT β λ(t|z1) λ0(t)e 1 T (z1−z0) β, t ≥ , = zT β =e for all 0 λ(t|z0) λ0(t)e 0 which is a constant over time (so the name of proportional hazards model).
    [Show full text]
  • 18.650 (F16) Lecture 10: Generalized Linear Models (Glms)
    Statistics for Applications Chapter 10: Generalized Linear Models (GLMs) 1/52 Linear model A linear model assumes Y X (µ(X), σ2I), | ∼ N And ⊤ IE(Y X) = µ(X) = X β, | 2/52 Components of a linear model The two components (that we are going to relax) are 1. Random component: the response variable Y X is continuous | and normally distributed with mean µ = µ(X) = IE(Y X). | 2. Link: between the random and covariates X = (X(1),X(2), ,X(p))⊤: µ(X) = X⊤β. · · · 3/52 Generalization A generalized linear model (GLM) generalizes normal linear regression models in the following directions. 1. Random component: Y some exponential family distribution ∼ 2. Link: between the random and covariates: ⊤ g µ(X) = X β where g called link function� and� µ = IE(Y X). | 4/52 Example 1: Disease Occuring Rate In the early stages of a disease epidemic, the rate at which new cases occur can often increase exponentially through time. Hence, if µi is the expected number of new cases on day ti, a model of the form µi = γ exp(δti) seems appropriate. ◮ Such a model can be turned into GLM form, by using a log link so that log(µi) = log(γ) + δti = β0 + β1ti. ◮ Since this is a count, the Poisson distribution (with expected value µi) is probably a reasonable distribution to try. 5/52 Example 2: Prey Capture Rate(1) The rate of capture of preys, yi, by a hunting animal, tends to increase with increasing density of prey, xi, but to eventually level off, when the predator is catching as much as it can cope with.
    [Show full text]
  • UC Riverside UC Riverside Previously Published Works
    UC Riverside UC Riverside Previously Published Works Title Fisher information matrix: A tool for dimension reduction, projection pursuit, independent component analysis, and more Permalink https://escholarship.org/uc/item/9351z60j Authors Lindsay, Bruce G Yao, Weixin Publication Date 2012 Peer reviewed eScholarship.org Powered by the California Digital Library University of California 712 The Canadian Journal of Statistics Vol. 40, No. 4, 2012, Pages 712–730 La revue canadienne de statistique Fisher information matrix: A tool for dimension reduction, projection pursuit, independent component analysis, and more Bruce G. LINDSAY1 and Weixin YAO2* 1Department of Statistics, The Pennsylvania State University, University Park, PA 16802, USA 2Department of Statistics, Kansas State University, Manhattan, KS 66502, USA Key words and phrases: Dimension reduction; Fisher information matrix; independent component analysis; projection pursuit; white noise matrix. MSC 2010: Primary 62-07; secondary 62H99. Abstract: Hui & Lindsay (2010) proposed a new dimension reduction method for multivariate data. It was based on the so-called white noise matrices derived from the Fisher information matrix. Their theory and empirical studies demonstrated that this method can detect interesting features from high-dimensional data even with a moderate sample size. The theoretical emphasis in that paper was the detection of non-normal projections. In this paper we show how to decompose the information matrix into non-negative definite information terms in a manner akin to a matrix analysis of variance. Appropriate information matrices will be identified for diagnostics for such important modern modelling techniques as independent component models, Markov dependence models, and spherical symmetry. The Canadian Journal of Statistics 40: 712– 730; 2012 © 2012 Statistical Society of Canada Resum´ e:´ Hui et Lindsay (2010) ont propose´ une nouvelle methode´ de reduction´ de la dimension pour les donnees´ multidimensionnelles.
    [Show full text]
  • Statistical Estimation in Multivariate Normal Distribution
    STATISTICAL ESTIMATION IN MULTIVARIATE NORMAL DISTRIBUTION WITH A BLOCK OF MISSING OBSERVATIONS by YI LIU Presented to the Faculty of the Graduate School of The University of Texas at Arlington in Partial Fulfillment of the Requirements for the Degree of DOCTOR OF PHILOSOPHY THE UNIVERSITY OF TEXAS AT ARLINGTON DECEMBER 2017 Copyright © by YI LIU 2017 All Rights Reserved ii To Alex and My Parents iii Acknowledgements I would like to thank my supervisor, Dr. Chien-Pai Han, for his instructing, guiding and supporting me over the years. You have set an example of excellence as a researcher, mentor, instructor and role model. I would like to thank Dr. Shan Sun-Mitchell for her continuously encouraging and instructing. You are both a good teacher and helpful friend. I would like to thank my thesis committee members Dr. Suvra Pal and Dr. Jonghyun Yun for their discussion, ideas and feedback which are invaluable. I would like to thank the graduate advisor, Dr. Hristo Kojouharov, for his instructing, help and patience. I would like to thank the chairman Dr. Jianzhong Su, Dr. Minerva Cordero-Epperson, Lona Donnelly, Libby Carroll and other staffs for their help. I would like to thank my manager, Robert Schermerhorn, for his understanding, encouraging and supporting which make this happen. I would like to thank my employer Sabre and my coworkers for their support in the past two years. I would like to thank my husband Alex for his encouraging and supporting over all these years. In particularly, I would like to thank my parents -- without the inspiration, drive and support that you have given me, I might not be the person I am today.
    [Show full text]
  • Optimal Variance Control of the Score Function Gradient Estimator for Importance Weighted Bounds
    Optimal Variance Control of the Score Function Gradient Estimator for Importance Weighted Bounds Valentin Liévin 1 Andrea Dittadi1 Anders Christensen1 Ole Winther1, 2, 3 1 Section for Cognitive Systems, Technical University of Denmark 2 Bioinformatics Centre, Department of Biology, University of Copenhagen 3 Centre for Genomic Medicine, Rigshospitalet, Copenhagen University Hospital {valv,adit}@dtu.dk, [email protected], [email protected] Abstract This paper introduces novel results for the score function gradient estimator of the importance weighted variational bound (IWAE). We prove that in the limit of large K (number of importance samples) one can choose the control variate such that the Signal-to-Noise ratio (SNR) of the estimator grows as pK. This is in contrast to the standard pathwise gradient estimator where the SNR decreases as 1/pK. Based on our theoretical findings we develop a novel control variate that extends on VIMCO. Empirically, for the training of both continuous and discrete generative models, the proposed method yields superior variance reduction, resulting in an SNR for IWAE that increases with K without relying on the reparameterization trick. The novel estimator is competitive with state-of-the-art reparameterization-free gradient estimators such as Reweighted Wake-Sleep (RWS) and the thermodynamic variational objective (TVO) when training generative models. 1 Introduction Gradient-based learning is now widespread in the field of machine learning, in which recent advances have mostly relied on the backpropagation algorithm, the workhorse of modern deep learning. In many instances, for example in the context of unsupervised learning, it is desirable to make models more expressive by introducing stochastic latent variables.
    [Show full text]
  • Bayesian Methods: Review of Generalized Linear Models
    Bayesian Methods: Review of Generalized Linear Models RYAN BAKKER University of Georgia ICPSR Day 2 Bayesian Methods: GLM [1] Likelihood and Maximum Likelihood Principles Likelihood theory is an important part of Bayesian inference: it is how the data enter the model. • The basis is Fisher’s principle: what value of the unknown parameter is “most likely” to have • generated the observed data. Example: flip a coin 10 times, get 5 heads. MLE for p is 0.5. • This is easily the most common and well-understood general estimation process. • Bayesian Methods: GLM [2] Starting details: • – Y is a n k design or observation matrix, θ is a k 1 unknown coefficient vector to be esti- × × mated, we want p(θ Y) (joint sampling distribution or posterior) from p(Y θ) (joint probabil- | | ity function). – Define the likelihood function: n L(θ Y) = p(Y θ) | i| i=1 Y which is no longer on the probability metric. – Our goal is the maximum likelihood value of θ: θˆ : L(θˆ Y) L(θ Y) θ Θ | ≥ | ∀ ∈ where Θ is the class of admissable values for θ. Bayesian Methods: GLM [3] Likelihood and Maximum Likelihood Principles (cont.) Its actually easier to work with the natural log of the likelihood function: • `(θ Y) = log L(θ Y) | | We also find it useful to work with the score function, the first derivative of the log likelihood func- • tion with respect to the parameters of interest: ∂ `˙(θ Y) = `(θ Y) | ∂θ | Setting `˙(θ Y) equal to zero and solving gives the MLE: θˆ, the “most likely” value of θ from the • | parameter space Θ treating the observed data as given.
    [Show full text]
  • Simple, Lower-Variance Gradient Estimators for Variational Inference
    Sticking the Landing: Simple, Lower-Variance Gradient Estimators for Variational Inference Geoffrey Roeder Yuhuai Wu University of Toronto University of Toronto [email protected] [email protected] David Duvenaud University of Toronto [email protected] Abstract We propose a simple and general variant of the standard reparameterized gradient estimator for the variational evidence lower bound. Specifically, we remove a part of the total derivative with respect to the variational parameters that corresponds to the score function. Removing this term produces an unbiased gradient estimator whose variance approaches zero as the approximate posterior approaches the exact posterior. We analyze the behavior of this gradient estimator theoretically and empirically, and generalize it to more complex variational distributions such as mixtures and importance-weighted posteriors. 1 Introduction Recent advances in variational inference have begun to Optimization using: make approximate inference practical in large-scale latent Path Derivative ) variable models. One of the main recent advances has Total Derivative been the development of variational autoencoders along true φ with the reparameterization trick [Kingma and Welling, k 2013, Rezende et al., 2014]. The reparameterization init trick is applicable to most continuous latent-variable mod- φ els, and usually provides lower-variance gradient esti- ( mates than the more general REINFORCE gradient es- KL timator [Williams, 1992]. Intuitively, the reparameterization trick provides more in- 400 600 800 1000 1200 formative gradients by exposing the dependence of sam- Iterations pled latent variables z on variational parameters φ. In contrast, the REINFORCE gradient estimate only de- Figure 1: Fitting a 100-dimensional varia- pends on the relationship between the density function tional posterior to another Gaussian, using log qφ(z x; φ) and its parameters.
    [Show full text]
  • A Sufficiency Paradox: an Insufficient Statistic Preserving the Fisher
    A Sufficiency Paradox: An Insufficient Statistic Preserving the Fisher Information Abram KAGAN and Lawrence A. SHEPP this it is possible to find an insufficient statistic T such that (1) holds. An example of a regular statistical experiment is constructed The following example illustrates this. Let g(x) be the density where an insufficient statistic preserves the Fisher information function of a gamma distribution Gamma(3, 1), that is, contained in the data. The data are a pair (∆,X) where ∆ is a 2 −x binary random variable and, given ∆, X has a density f(x−θ|∆) g(x)=(1/2)x e ,x≥ 0; = 0,x<0. depending on a location parameter θ. The phenomenon is based on the fact that f(x|∆) smoothly vanishes at one point; it can be Take a binary variable ∆ with eliminated by adding to the regularity of a statistical experiment P (∆=1)= w , positivity of the density function. 1 P (∆=2)= w2,w1 + w2 =1,w1 =/ w2 (2) KEY WORDS: Convexity; Regularity; Statistical experi- ment. and let Y be a continuous random variable with the conditional density f(y|∆) given by f1(y)=f(y|∆=1)=.7g(y),y≥ 0; = .3g(−y),y≤ 0, 1. INTRODUCTION (3) Let P = {p(x; θ),θ∈ Θ} be a parametric family of densities f2(y)=f(y|∆=2)=.3g(y),y≥ 0; = .7g(−y),y≤ 0. (with respect to a measure µ) of a random element X taking values in a measurable space (X , A), the parameter space Θ (4) being an interval. Following Ibragimov and Has’minskii (1981, There is nothing special in the pair (.7,.3); any pair of positive chap.
    [Show full text]
  • STAT 830 Likelihood Methods
    STAT 830 Likelihood Methods Richard Lockhart Simon Fraser University STAT 830 — Fall 2011 Richard Lockhart (Simon Fraser University) STAT 830 Likelihood Methods STAT 830 — Fall 2011 1 / 32 Purposes of These Notes Define the likelihood, log-likelihood and score functions. Summarize likelihood methods Describe maximum likelihood estimation Give sequence of examples. Richard Lockhart (Simon Fraser University) STAT 830 Likelihood Methods STAT 830 — Fall 2011 2 / 32 Likelihood Methods of Inference Toss a thumb tack 6 times and imagine it lands point up twice. Let p be probability of landing points up. Probability of getting exactly 2 point up is 15p2(1 − p)4 This function of p, is the likelihood function. Def’n: The likelihood function is map L: domain Θ, values given by L(θ)= fθ(X ) Key Point: think about how the density depends on θ not about how it depends on X . Notice: X , observed value of the data, has been plugged into the formula for density. Notice: coin tossing example uses the discrete density for f . We use likelihood for most inference problems: Richard Lockhart (Simon Fraser University) STAT 830 Likelihood Methods STAT 830 — Fall 2011 3 / 32 List of likelihood techniques Point estimation: we must compute an estimate θˆ = θˆ(X ) which lies in Θ. The maximum likelihood estimate (MLE) of θ is the value θˆ which maximizes L(θ) over θ ∈ Θ if such a θˆ exists. Point estimation of a function of θ: we must compute an estimate φˆ = φˆ(X ) of φ = g(θ). We use φˆ = g(θˆ) where θˆ is the MLE of θ.
    [Show full text]
  • A Note on Inference in a Bivariate Normal Distribution Model Jaya
    A Note on Inference in a Bivariate Normal Distribution Model Jaya Bishwal and Edsel A. Peña Technical Report #2009-3 December 22, 2008 This material was based upon work partially supported by the National Science Foundation under Grant DMS-0635449 to the Statistical and Applied Mathematical Sciences Institute. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. Statistical and Applied Mathematical Sciences Institute PO Box 14006 Research Triangle Park, NC 27709-4006 www.samsi.info A Note on Inference in a Bivariate Normal Distribution Model Jaya Bishwal¤ Edsel A. Pena~ y December 22, 2008 Abstract Suppose observations are available on two variables Y and X and interest is on a parameter that is present in the marginal distribution of Y but not in the marginal distribution of X, and with X and Y dependent and possibly in the presence of other parameters which are nuisance. Could one gain more e±ciency in the point estimation (also, in hypothesis testing and interval estimation) about the parameter of interest by using the full data (both Y and X values) instead of just the Y values? Also, how should one measure the information provided by random observables or their distributions about the parameter of interest? We illustrate these issues using a simple bivariate normal distribution model. The ideas could have important implications in the context of multiple hypothesis testing or simultaneous estimation arising in the analysis of microarray data, or in the analysis of event time data especially those dealing with recurrent event data.
    [Show full text]
  • Risk, Scores, Fisher Information, and Glrts (Supplementary Material for Math 494) Stanley Sawyer — Washington University Vs
    Risk, Scores, Fisher Information, and GLRTs (Supplementary Material for Math 494) Stanley Sawyer | Washington University Vs. April 24, 2010 Table of Contents 1. Statistics and Esimatiors 2. Unbiased Estimators, Risk, and Relative Risk 3. Scores and Fisher Information 4. Proof of the Cram¶erRao Inequality 5. Maximum Likelihood Estimators are Asymptotically E±cient 6. The Most Powerful Hypothesis Tests are Likelihood Ratio Tests 7. Generalized Likelihood Ratio Tests 8. Fisher's Meta-Analysis Theorem 9. A Tale of Two Contingency-Table Tests 1. Statistics and Estimators. Let X1;X2;:::;Xn be an independent sample of observations from a probability density f(x; θ). Here f(x; θ) can be either discrete (like the Poisson or Bernoulli distributions) or continuous (like normal and exponential distributions). In general, a statistic is an arbitrary function T (X1;:::;Xn) of the data values X1;:::;Xn. Thus T (X) for X = (X1;X2;:::;Xn) can depend on X1;:::;Xn, but cannot depend on θ. Some typical examples of statistics are X + X + ::: + X T (X ;:::;X ) = X = 1 2 n (1:1) 1 n n = Xmax = maxf Xk : 1 · k · n g = Xmed = medianf Xk : 1 · k · n g These examples have the property that the statistic T (X) is a symmetric function of X = (X1;:::;Xn). That is, any permutation of the sample X1;:::;Xn preserves the value of the statistic. This is not true in general: For example, for n = 4 and X4 > 0, T (X1;X2;X3;X4) = X1X2 + (1=2)X3=X4 is also a statistic. A statistic T (X) is called an estimator of a parameter θ if it is a statistic that we think might give a reasonable guess for the true value of θ.
    [Show full text]