Fisher Information and Cram´Er-Raobound

Total Page:16

File Type:pdf, Size:1020Kb

Fisher Information and Cram´Er-Raobound Math 541: Statistical Theory II Fisher Information and Cram¶er-RaoBound Instructor: Songfeng Zheng In the parameter estimation problems, we obtain information about the parameter from a sample of data coming from the underlying probability distribution. A natural question is: how much information can a sample of data provide about the unknown parameter? This section introduces such a measure for information, and we can also see that this information measure can be used to ¯nd bounds on the variance of estimators, and it can be used to approximate the sampling distribution of an estimator obtained from a large sample, and further be used to obtain an approximate con¯dence interval in case of large sample. In this section, we consider a random variable X for which the pdf or pmf is f(xjθ), where θ is an unknown parameter and θ 2 £, with £ is the parameter space. 1 Fisher Information Motivation: Intuitively, if an event has small probability, then the occurrence of this event brings us much information. For a random variable X » f(xjθ), if θ were the true value of the parameter, the likelihood function should take a big value, or equivalently, the derivative log-likelihood function should be close to zero, and this is the basic principle of maximum likelihood estimation. We de¯ne l(xjθ) = log f(xjθ) as the log-likelihood function, and @ f 0(xjθ) l0(xjθ) = log f(xjθ) = @θ f(xjθ) where f 0(xjθ) is the derivative of f(xjθ) with respect to θ. Similarly, we denote the second order derivative of f(xjθ) with respect to θ as f 00(xjθ). According to the above analysis, if l0(Xjθ) is close to zero, then it is expected, thus the random variable does not provide much information about θ; on the other hand, if jl0(Xjθ)j or [l0(Xjθ)]2 is large, the random variable provides much information about θ. Thus, we can use [l0(Xjθ)]2 to measure the amount of information provided by X. However, since X is a random variable, we should consider the average case. Thus, we introduce the following de¯nition: Fisher information (for θ) contained in the random variable X is de¯ned as: Z © 0 2ª 0 2 I(θ) = Eθ [l (Xjθ))] = [l (xjθ))] f(xjθ)dx: (1) 1 2 We assume that we can exchange the order of di®erentiation and integration, then Z Z @ f 0(xjθ)dx = f(xjθ)dx = 0 @θ Similarly, Z Z @2 f 00(xjθ)dx = f(xjθ)dx = 0 @θ2 It is easy to see that Z Z Z f 0(xjθ) E [l0(Xjθ)] = l0(xjθ)f(xjθ)dx = f(xjθ)dx = f 0(xjθ)dx = 0 θ f(xjθ) Therefore, the de¯nition of Fisher information (1) can be rewritten as 0 I(θ) = Varθ [l (Xjθ))] (2) Also, notice that · ¸ @ f 0(xjθ) f 00(xjθ)f(xjθ) ¡ [f 0(xjθ)]2 f 00(xjθ) l00(xjθ) = = = ¡ [l0(xjθ)]2 @θ f(xjθ) [f(xjθ)]2 f(xjθ) Therefore, Z · ¸ Z f 00(xjθ) © ª E [l00(xjθ)] = ¡ [l0(xjθ)]2 f(xjθ)dx = f 00(xjθ)dx ¡ E [l0(Xjθ)]2 = ¡I(θ) θ f(xjθ) θ Finally, we have another formula to calculate Fisher information: Z · ¸ @2 I(θ) = ¡E [l00(xjθ)] = ¡ log f(xjθ) f(xjθ)dx (3) θ @θ2 To summarize, we have three methods to calculate Fisher information: equations (1), (2), and (3). In many problems, using (3) is the most convenient choice. Example 1: Suppose random variable X has a Bernoulli distribution for which the pa- rameter θ is unknown (0 < θ < 1). We shall determine the Fisher information I(θ) in X. The point mass function of X is f(xjθ) = θx(1 ¡ θ)1¡x for x = 1 or x = 0: Therefore l(xjθ) = log f(xjθ) = x log θ + (1 ¡ x) log(1 ¡ θ) 3 and x 1 ¡ x x 1 ¡ x l0(xjθ) = ¡ and l00(xjθ) = ¡ ¡ θ 1 ¡ θ θ2 (1 ¡ θ)2 Since E(X) = θ, the Fisher information is E(X) 1 ¡ E(X) 1 1 1 I(xjθ) = ¡E[l00(xjθ)] = + = + = θ2 (1 ¡ θ)2 θ 1 ¡ θ θ(1 ¡ θ) Example 2: Suppose that X » N(¹; σ2), and ¹ is unknown, but the value of σ2 is given. ¯nd the Fisher information I(¹) in X. For ¡1 < x < 1, we have 1 (x ¡ ¹)2 l(xj¹) = log f(xj¹) = ¡ log(2¼σ2) ¡ 2 2σ2 Hence, x ¡ ¹ 1 l0(xj¹) = and l00(xj¹) = ¡ σ2 σ2 It follows that the Fisher information is 1 I(¹) = ¡E[l00(xj¹)] = σ2 If we make a transformation of the parameter, we will have di®erent expressions of Fisher information with di®erent parameterization. More speci¯cally, let X be a random variable for which the pdf or pmf is f(xjθ), where the value of the parameter θ is unknown but must lie in a space £. Let I0(θ) denote the Fisher information in X. Suppose now the parameter θ is replaced by a new parameter ¹, where θ = Á(¹), and Á is a di®erentiable function. Let I1(¹) denote the Fisher information in X when the parameter is regarded as ¹. We will have 0 2 I1(¹) = [Á (¹)] I0[Á(¹)]. Proof: Let g(xj¹) be the p.d.f. or p.m.f. of X when ¹ is regarded as the parameter. Then g(xj¹) = f[xjÁ(¹)]. Therefore, log g(xj¹) = log f[xjÁ(¹)] = l[xjÁ(¹)]; and @ log g(xj¹) = l0[xjÁ(¹)]Á0(¹): @¹ It follows that (· ¸ ) @ 2 ³ ´ I (¹) = E log g(Xj¹) = [Á0(¹)]2E fl0[XjÁ(¹)]g2 = [Á0(¹)]2I [Á(¹)] 1 @¹ 0 This will be veri¯ed in exercise problems. 4 Suppose that we have a random sample X1; ¢ ¢ ¢ ;Xn coming from a distribution for which the pdf or pmf is f(xjθ), where the value of the parameter θ is unknown. Let us now calculate the amount of information the random sample X1; ¢ ¢ ¢ ;Xn provides for θ. Let us denote the joint pdf of X1; ¢ ¢ ¢ ;Xn as Yn fn(xjθ) = f(xijθ) i=1 then Xn Xn ln(xjθ) = log fn(xjθ) = log f(xijθ) = l(xijθ): i=1 i=1 and 0 0 fn(xjθ) ln(xjθ) = (4) fn(xjθ) We de¯ne the Fisher information In(θ) in the random sample X1; ¢ ¢ ¢ ;Xn as Z Z © 0 2ª 0 2 In(θ) = Eθ [ln(Xjθ)] = ¢ ¢ ¢ [ln(Xjθ)] fn(xjθ)dx1 ¢ ¢ ¢ dxn which is an n-dimensional integral. We further assume that we can exchange the order of di®erentiation and integration, then we have Z Z @ f 0 (xjθ)dx = f (xjθ)dx = 0 n @θ n and, Z Z @2 f 00(xjθ)dx = f (xjθ)dx = 0 n @θ2 n It is easy to see that Z Z 0 Z 0 0 fn(xjθ) 0 Eθ[ln(Xjθ)] = ln(xjθ)fn(xjθ)dx = fn(xjθ)dx = fn(xjθ)dx = 0 (5) fn(xjθ) Therefore, the de¯nition of Fisher information for the sample X1; ¢ ¢ ¢ ;Xn can be rewritten as 0 In(θ) = Varθ [ln(Xjθ))] : It is similar to prove that the Fisher information can also be calculated as 00 In(θ) = ¡Eθ [ln(Xjθ))] : From the de¯nition of ln(xjθ), it follows that Xn 00 00 ln(xjθ) = l (xijθ): i=1 5 Therefore, the Fisher information " # Xn Xn 00 00 00 In(θ) = ¡Eθ [ln(Xjθ))] = ¡Eθ l (Xijθ) = ¡ Eθ [l (Xijθ)] = nI(θ): i=1 i=1 In other words, the Fisher information in a random sample of size n is simply n times the Fisher information in a single observation. Example 3: Suppose X1; ¢ ¢ ¢ ;Xn form a random sample from a Bernoulli distribution for which the parameter θ is unknown (0 < θ < 1). Then the Fisher information In(θ) in this sample is n I (θ) = nI(θ) = : n θ(1 ¡ θ) 2 Example 4: Let X1; ¢ ¢ ¢ ;Xn be a random sample from N(¹; σ ), and ¹ is unknown, but 2 the value of σ is given. Then the Fisher information In(θ) in this sample is n I (¹) = nI(¹) = : n σ2 2 Cram¶er-RaoLower Bound and Asymptotic Distri- bution of Maximum Likelihood Estimators Suppose that we have a random sample X1; ¢ ¢ ¢ ;Xn coming from a distribution for which the pdf or pmf is f(xjθ), where the value of the parameter θ is unknown. We will show how to used Fisher information to determine the lower bound for the variance of an estimator of the parameter θ. ^ ^ Let θ = r(X1; ¢ ¢ ¢ ;Xn) = r(X) be an arbitrary estimator of θ. Assume Eθ(θ) = m(θ), and ^ 0 the variance of θ is ¯nite. Let us consider the random variable ln(Xjθ) de¯ned in (4), it was 0 ^ 0 shown in (5) that Eθ[ln(Xjθ)] = 0. Therefore, the covariance between θ and ln(Xjθ) is n o ^ 0 ^ ^ 0 0 0 Covθ[θ; ln(Xjθ)] = Eθ [θ ¡ Eθ(θ)][ln(Xjθ) ¡ Eθ(ln(Xjθ))] = Eθ f[r(X) ¡ m(θ)]ln(Xjθ)g = E [r(X)l0 (Xjθ)] ¡ m(θ)E [l0 (Xjθ)] = E [r(X)l0 (Xjθ)] Z θ Z n θ n θ n 0 = ¢ ¢ ¢ r(x)ln(xjθ)fn(xjθ)dx1 ¢ ¢ ¢ dxn Z Z 0 = ¢ ¢ ¢ r(x)fn(xjθ)dx1 ¢ ¢ ¢ dxn (Use Equation 4) Z Z @ = ¢ ¢ ¢ r(x)f (xjθ)dx ¢ ¢ ¢ dx @θ n 1 n @ = E [θ^] = m0(θ) (6) @θ θ 6 By Cauchy-Schwartz inequality and the de¯nition of In(θ), n o2 ^ 0 ^ 0 ^ Covθ[θ; ln(Xjθ)] · Varθ[θ]Varθ[ln(Xjθ)] = Varθ[θ]In(θ) i.e., 0 2 ^ ^ [m (θ)] · Varθ[θ]In(θ) = nI(θ)Varθ[θ] Finally, we get the lower bound of variance of an arbitrary estimator θ^ as [m0(θ)]2 Var [θ^] ¸ (7) θ nI(θ) The inequality (7) is called the information inequality, and also known as the Cram¶er-Rao inequality in honor of the Sweden statistician H.
Recommended publications
  • UC Riverside UC Riverside Previously Published Works
    UC Riverside UC Riverside Previously Published Works Title Fisher information matrix: A tool for dimension reduction, projection pursuit, independent component analysis, and more Permalink https://escholarship.org/uc/item/9351z60j Authors Lindsay, Bruce G Yao, Weixin Publication Date 2012 Peer reviewed eScholarship.org Powered by the California Digital Library University of California 712 The Canadian Journal of Statistics Vol. 40, No. 4, 2012, Pages 712–730 La revue canadienne de statistique Fisher information matrix: A tool for dimension reduction, projection pursuit, independent component analysis, and more Bruce G. LINDSAY1 and Weixin YAO2* 1Department of Statistics, The Pennsylvania State University, University Park, PA 16802, USA 2Department of Statistics, Kansas State University, Manhattan, KS 66502, USA Key words and phrases: Dimension reduction; Fisher information matrix; independent component analysis; projection pursuit; white noise matrix. MSC 2010: Primary 62-07; secondary 62H99. Abstract: Hui & Lindsay (2010) proposed a new dimension reduction method for multivariate data. It was based on the so-called white noise matrices derived from the Fisher information matrix. Their theory and empirical studies demonstrated that this method can detect interesting features from high-dimensional data even with a moderate sample size. The theoretical emphasis in that paper was the detection of non-normal projections. In this paper we show how to decompose the information matrix into non-negative definite information terms in a manner akin to a matrix analysis of variance. Appropriate information matrices will be identified for diagnostics for such important modern modelling techniques as independent component models, Markov dependence models, and spherical symmetry. The Canadian Journal of Statistics 40: 712– 730; 2012 © 2012 Statistical Society of Canada Resum´ e:´ Hui et Lindsay (2010) ont propose´ une nouvelle methode´ de reduction´ de la dimension pour les donnees´ multidimensionnelles.
    [Show full text]
  • Statistical Estimation in Multivariate Normal Distribution
    STATISTICAL ESTIMATION IN MULTIVARIATE NORMAL DISTRIBUTION WITH A BLOCK OF MISSING OBSERVATIONS by YI LIU Presented to the Faculty of the Graduate School of The University of Texas at Arlington in Partial Fulfillment of the Requirements for the Degree of DOCTOR OF PHILOSOPHY THE UNIVERSITY OF TEXAS AT ARLINGTON DECEMBER 2017 Copyright © by YI LIU 2017 All Rights Reserved ii To Alex and My Parents iii Acknowledgements I would like to thank my supervisor, Dr. Chien-Pai Han, for his instructing, guiding and supporting me over the years. You have set an example of excellence as a researcher, mentor, instructor and role model. I would like to thank Dr. Shan Sun-Mitchell for her continuously encouraging and instructing. You are both a good teacher and helpful friend. I would like to thank my thesis committee members Dr. Suvra Pal and Dr. Jonghyun Yun for their discussion, ideas and feedback which are invaluable. I would like to thank the graduate advisor, Dr. Hristo Kojouharov, for his instructing, help and patience. I would like to thank the chairman Dr. Jianzhong Su, Dr. Minerva Cordero-Epperson, Lona Donnelly, Libby Carroll and other staffs for their help. I would like to thank my manager, Robert Schermerhorn, for his understanding, encouraging and supporting which make this happen. I would like to thank my employer Sabre and my coworkers for their support in the past two years. I would like to thank my husband Alex for his encouraging and supporting over all these years. In particularly, I would like to thank my parents -- without the inspiration, drive and support that you have given me, I might not be the person I am today.
    [Show full text]
  • A Sufficiency Paradox: an Insufficient Statistic Preserving the Fisher
    A Sufficiency Paradox: An Insufficient Statistic Preserving the Fisher Information Abram KAGAN and Lawrence A. SHEPP this it is possible to find an insufficient statistic T such that (1) holds. An example of a regular statistical experiment is constructed The following example illustrates this. Let g(x) be the density where an insufficient statistic preserves the Fisher information function of a gamma distribution Gamma(3, 1), that is, contained in the data. The data are a pair (∆,X) where ∆ is a 2 −x binary random variable and, given ∆, X has a density f(x−θ|∆) g(x)=(1/2)x e ,x≥ 0; = 0,x<0. depending on a location parameter θ. The phenomenon is based on the fact that f(x|∆) smoothly vanishes at one point; it can be Take a binary variable ∆ with eliminated by adding to the regularity of a statistical experiment P (∆=1)= w , positivity of the density function. 1 P (∆=2)= w2,w1 + w2 =1,w1 =/ w2 (2) KEY WORDS: Convexity; Regularity; Statistical experi- ment. and let Y be a continuous random variable with the conditional density f(y|∆) given by f1(y)=f(y|∆=1)=.7g(y),y≥ 0; = .3g(−y),y≤ 0, 1. INTRODUCTION (3) Let P = {p(x; θ),θ∈ Θ} be a parametric family of densities f2(y)=f(y|∆=2)=.3g(y),y≥ 0; = .7g(−y),y≤ 0. (with respect to a measure µ) of a random element X taking values in a measurable space (X , A), the parameter space Θ (4) being an interval. Following Ibragimov and Has’minskii (1981, There is nothing special in the pair (.7,.3); any pair of positive chap.
    [Show full text]
  • A Note on Inference in a Bivariate Normal Distribution Model Jaya
    A Note on Inference in a Bivariate Normal Distribution Model Jaya Bishwal and Edsel A. Peña Technical Report #2009-3 December 22, 2008 This material was based upon work partially supported by the National Science Foundation under Grant DMS-0635449 to the Statistical and Applied Mathematical Sciences Institute. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. Statistical and Applied Mathematical Sciences Institute PO Box 14006 Research Triangle Park, NC 27709-4006 www.samsi.info A Note on Inference in a Bivariate Normal Distribution Model Jaya Bishwal¤ Edsel A. Pena~ y December 22, 2008 Abstract Suppose observations are available on two variables Y and X and interest is on a parameter that is present in the marginal distribution of Y but not in the marginal distribution of X, and with X and Y dependent and possibly in the presence of other parameters which are nuisance. Could one gain more e±ciency in the point estimation (also, in hypothesis testing and interval estimation) about the parameter of interest by using the full data (both Y and X values) instead of just the Y values? Also, how should one measure the information provided by random observables or their distributions about the parameter of interest? We illustrate these issues using a simple bivariate normal distribution model. The ideas could have important implications in the context of multiple hypothesis testing or simultaneous estimation arising in the analysis of microarray data, or in the analysis of event time data especially those dealing with recurrent event data.
    [Show full text]
  • Risk, Scores, Fisher Information, and Glrts (Supplementary Material for Math 494) Stanley Sawyer — Washington University Vs
    Risk, Scores, Fisher Information, and GLRTs (Supplementary Material for Math 494) Stanley Sawyer | Washington University Vs. April 24, 2010 Table of Contents 1. Statistics and Esimatiors 2. Unbiased Estimators, Risk, and Relative Risk 3. Scores and Fisher Information 4. Proof of the Cram¶erRao Inequality 5. Maximum Likelihood Estimators are Asymptotically E±cient 6. The Most Powerful Hypothesis Tests are Likelihood Ratio Tests 7. Generalized Likelihood Ratio Tests 8. Fisher's Meta-Analysis Theorem 9. A Tale of Two Contingency-Table Tests 1. Statistics and Estimators. Let X1;X2;:::;Xn be an independent sample of observations from a probability density f(x; θ). Here f(x; θ) can be either discrete (like the Poisson or Bernoulli distributions) or continuous (like normal and exponential distributions). In general, a statistic is an arbitrary function T (X1;:::;Xn) of the data values X1;:::;Xn. Thus T (X) for X = (X1;X2;:::;Xn) can depend on X1;:::;Xn, but cannot depend on θ. Some typical examples of statistics are X + X + ::: + X T (X ;:::;X ) = X = 1 2 n (1:1) 1 n n = Xmax = maxf Xk : 1 · k · n g = Xmed = medianf Xk : 1 · k · n g These examples have the property that the statistic T (X) is a symmetric function of X = (X1;:::;Xn). That is, any permutation of the sample X1;:::;Xn preserves the value of the statistic. This is not true in general: For example, for n = 4 and X4 > 0, T (X1;X2;X3;X4) = X1X2 + (1=2)X3=X4 is also a statistic. A statistic T (X) is called an estimator of a parameter θ if it is a statistic that we think might give a reasonable guess for the true value of θ.
    [Show full text]
  • Evaluating Fisher Information in Order Statistics Mary F
    Southern Illinois University Carbondale OpenSIUC Research Papers Graduate School 2011 Evaluating Fisher Information in Order Statistics Mary F. Dorn [email protected] Follow this and additional works at: http://opensiuc.lib.siu.edu/gs_rp Recommended Citation Dorn, Mary F., "Evaluating Fisher Information in Order Statistics" (2011). Research Papers. Paper 166. http://opensiuc.lib.siu.edu/gs_rp/166 This Article is brought to you for free and open access by the Graduate School at OpenSIUC. It has been accepted for inclusion in Research Papers by an authorized administrator of OpenSIUC. For more information, please contact [email protected]. EVALUATING FISHER INFORMATION IN ORDER STATISTICS by Mary Frances Dorn B.A., University of Chicago, 2009 A Research Paper Submitted in Partial Fulfillment of the Requirements for the Master of Science Degree Department of Mathematics in the Graduate School Southern Illinois University Carbondale July 2011 RESEARCH PAPER APPROVAL EVALUATING FISHER INFORMATION IN ORDER STATISTICS By Mary Frances Dorn A Research Paper Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science in the field of Mathematics Approved by: Dr. Sakthivel Jeyaratnam, Chair Dr. David Olive Dr. Kathleen Pericak-Spector Graduate School Southern Illinois University Carbondale July 14, 2011 ACKNOWLEDGMENTS I would like to express my deepest appreciation to my advisor, Dr. Sakthivel Jeyaratnam, for his help and guidance throughout the development of this paper. Special thanks go to Dr. Olive and Dr. Pericak-Spector for serving on my committee. I would also like to thank the Department of Mathematics at SIUC for providing many wonderful learning experiences. Finally, none of this would have been possible without the continuous support and encouragement from my parents.
    [Show full text]
  • Information-Geometric Optimization Algorithms: a Unifying Picture Via Invariance Principles
    Information-Geometric Optimization Algorithms: A Unifying Picture via Invariance Principles Ludovic Arnold, Anne Auger, Nikolaus Hansen, Yann Ollivier Abstract We present a canonical way to turn any smooth parametric family of probability distributions on an arbitrary search space 푋 into a continuous-time black-box optimization method on 푋, the information- geometric optimization (IGO) method. Invariance as a major design principle keeps the number of arbitrary choices to a minimum. The resulting method conducts a natural gradient ascent using an adaptive, time-dependent transformation of the objective function, and makes no particular assumptions on the objective function to be optimized. The IGO method produces explicit IGO algorithms through time discretization. The cross-entropy method is recovered in a particular case with a large time step, and can be extended into a smoothed, parametrization-independent maximum likelihood update. When applied to specific families of distributions on discrete or con- tinuous spaces, the IGO framework allows to naturally recover versions of known algorithms. From the family of Gaussian distributions on R푑, we arrive at a version of the well-known CMA-ES algorithm. From the family of Bernoulli distributions on {0, 1}푑, we recover the seminal PBIL algorithm. From the distributions of restricted Boltzmann ma- chines, we naturally obtain a novel algorithm for discrete optimization on {0, 1}푑. The IGO method achieves, thanks to its intrinsic formulation, max- imal invariance properties: invariance under reparametrization of the search space 푋, under a change of parameters of the probability dis- tribution, and under increasing transformation of the function to be optimized. The latter is achieved thanks to an adaptive formulation of the objective.
    [Show full text]
  • Matrix Algebraic Properties of the Fisher Information Matrix of Stationary Processes
    Entropy 2014, 16, 2023-2055; doi:10.3390/e16042023 OPEN ACCESS entropy ISSN 1099-4300 www.mdpi.com/journal/entropy Article Matrix Algebraic Properties of the Fisher Information Matrix of Stationary Processes Andre´ Klein Rothschild Blv. 123 Apt.7, 65271 Tel Aviv, Israel; E-Mail: [email protected] or [email protected]; Tel.: 972.5.25594723 Received: 12 February 2014; in revised form: 11 March 2014 / Accepted: 24 March 2014 / Published: 8 April 2014 Abstract: In this survey paper, a summary of results which are to be found in a series of papers, is presented. The subject of interest is focused on matrix algebraic properties of the Fisher information matrix (FIM) of stationary processes. The FIM is an ingredient of the Cramer-Rao´ inequality, and belongs to the basics of asymptotic estimation theory in mathematical statistics. The FIM is interconnected with the Sylvester, Bezout and tensor Sylvester matrices. Through these interconnections it is shown that the FIM of scalar and multiple stationary processes fulfill the resultant matrix property. A statistical distance measure involving entries of the FIM is presented. In quantum information, a different statistical distance measure is set forth. It is related to the Fisher information but where the information about one parameter in a particular measurement procedure is considered. The FIM of scalar stationary processes is also interconnected to the solutions of appropriate Stein equations, conditions for the FIM to verify certain Stein equations are formulated. The presence of Vandermonde matrices is also emphasized. Keywords: Bezout matrix; Sylvester matrix; tensor Sylvester matrix; Stein equation; Vandermonde matrix; stationary process; matrix resultant; Fisher information matrix MSC Classification: 15A23, 15A24, 15B99, 60G10, 62B10, 62M20.
    [Show full text]
  • Fisher Information Matrix for Gaussian and Categorical Distributions
    Fisher information matrix for Gaussian and categorical distributions Jakub M. Tomczak November 28, 2012 1 Notations Let x be a random variable. Consider a parametric distribution of x with parameters θ, p(xjθ). The contiuous random variable x 2 R can be modelled by normal distribution (Gaussian distribution): 1 n (x − µ)2 o p(xjθ) = p exp − 2πσ2 2σ2 = N (xjµ, σ2); (1) where θ = µ σ2T. A discrete (categorical) variable x 2 X , X is a finite set of K values, can be modelled by categorical distribution:1 K Y xk p(xjθ) = θk k=1 = Cat(xjθ); (2) P where 0 ≤ θk ≤ 1, k θk = 1. For X = f0; 1g we get a special case of the categorical distribution, Bernoulli distribution, p(xjθ) = θx(1 − θ)1−x = Bern(xjθ): (3) 2 Fisher information matrix 2.1 Definition The Fisher score is determined as follows [1]: g(θ; x) = rθ ln p(xjθ): (4) The Fisher information matrix is defined as follows [1]: T F = Ex g(θ; x) g(θ; x) : (5) 1We use the 1-of-K encoding [1]. 1 2.2 Example 1: Bernoulli distribution Let us calculate the fisher matrix for Bernoulli distribution (3). First, we need to take the logarithm: ln Bern(xjθ) = x ln θ + (1 − x) ln(1 − θ): (6) Second, we need to calculate the derivative: d x 1 − x ln Bern(xjθ) = − dθ θ 1 − θ x − θ = : (7) θ(1 − θ) Hence, we get the following Fisher score for the Bernoulli distribution: x − θ g(θ; x) = : (8) θ(1 − θ) The Fisher information matrix (here it is a scalar) for the Bernoulli distribution is as follows: F = Ex[g(θ; x) g(θ; x)] h (x − θ)2 i = Ex (θ(1 − θ))2 1 n o = [x2 − 2xθ + θ2] (θ(1 − θ))2 Ex 1 n o = [x2] − 2θ [x] + θ2 (θ(1 − θ))2 Ex Ex 1 n o = θ − 2θ2 + θ2 (θ(1 − θ))2 1 = θ(1 − θ) (θ(1 − θ))2 1 = : (9) θ(1 − θ) 2.3 Example 2: Categorical distribution Let us calculate the fisher matrix for categorical distribution (2).
    [Show full text]
  • Efficient Monte Carlo Computation of Fisher Information Matrix Using Prior
    Introduction and Motivation Contribution/Results of the Proposed Work Summary Efficient Monte Carlo computation of Fisher information matrix using prior information Sonjoy Das, UB James C. Spall, APL/JHU Roger Ghanem, USC SIAM Conference on Data Mining Anaheim, California, USA April 26–28, 2012 Sonjoy Das James C. Spall Roger Ghanem Improved resampling algorithm Introduction and Motivation Contribution/Results of the Proposed Work Summary Outline 1 Introduction and Motivation General discussion of Fisher information matrix Current resampling algorithm – No use of prior information 2 Contribution/Results of the Proposed Work Improved resampling algorithm – using prior information Theoretical basis Numerical Illustrations Sonjoy Das James C. Spall Roger Ghanem Improved resampling algorithm Introduction and Motivation Contribution/Results of the Proposed Work Summary Outline 1 Introduction and Motivation General discussion of Fisher information matrix Current resampling algorithm – No use of prior information 2 Contribution/Results of the Proposed Work Improved resampling algorithm – using prior information Theoretical basis Numerical Illustrations Sonjoy Das James C. Spall Roger Ghanem Improved resampling algorithm Introduction and Motivation Contribution/Results of the Proposed Work Summary Outline 1 Introduction and Motivation General discussion of Fisher information matrix Current resampling algorithm – No use of prior information 2 Contribution/Results of the Proposed Work Improved resampling algorithm – using prior information Theoretical
    [Show full text]
  • Information-Geometric Optimization Algorithms: a Unifying Picture Via Invariance Principles
    Journal of Machine Learning Research 18 (2017) 1-65 Submitted 11/14; Revised 10/16; Published 4/17 Information-Geometric Optimization Algorithms: A Unifying Picture via Invariance Principles Yann Ollivier [email protected] CNRS & LRI (UMR 8623), Université Paris-Saclay 91405 Orsay, France Ludovic Arnold [email protected] Univ. Paris-Sud, LRI 91405 Orsay, France Anne Auger [email protected] Nikolaus Hansen [email protected] Inria & CMAP, Ecole polytechnique 91128 Palaiseau, France Editor: Una-May O’Reilly Abstract We present a canonical way to turn any smooth parametric family of probability distributions on an arbitrary search space 푋 into a continuous-time black-box optimization method on 푋, the information-geometric optimization (IGO) method. Invariance as a major design principle keeps the number of arbitrary choices to a minimum. The resulting IGO flow is the flow of an ordinary differential equation conducting the natural gradient ascentofan adaptive, time-dependent transformation of the objective function. It makes no particular assumptions on the objective function to be optimized. The IGO method produces explicit IGO algorithms through time discretization. It naturally recovers versions of known algorithms and offers a systematic way to derive new ones. In continuous search spaces, IGO algorithms take a form related to natural evolution strategies (NES). The cross-entropy method is recovered in a particular case with a large time step, and can be extended into a smoothed, parametrization-independent maximum likelihood update (IGO-ML). When applied to the family of Gaussian distributions on R푑, the IGO framework recovers a version of the well-known CMA-ES algorithm and of xNES.
    [Show full text]
  • Optimal Experimental Design for Machine Learning Using the Fisher Information Tracianne B
    Optimal experimental design for machine learning using the Fisher information Tracianne B. Neilsen, David F. Van Komen, Mark K. Transtrum, Makenzie B. Allen, and David P. Knobles Citation: Proc. Mtgs. Acoust. 35, 055004 (2018); doi: 10.1121/2.0000953 View online: https://doi.org/10.1121/2.0000953 View Table of Contents: https://asa.scitation.org/toc/pma/35/1 Published by the Acoustical Society of America ARTICLES YOU MAY BE INTERESTED IN Optimal experimental design for machine learning using the Fisher information matrix The Journal of the Acoustical Society of America 144, 1730 (2018); https://doi.org/10.1121/1.5067675 Blind equalization and automatic modulation classification of underwater acoustic signals Proceedings of Meetings on Acoustics 35, 055003 (2018); https://doi.org/10.1121/2.0000952 Characterizing the seabed by using noise interferometry and time warping Proceedings of Meetings on Acoustics 35, 070001 (2018); https://doi.org/10.1121/2.0000955 Using a multi-rod waveguide system to create an ultrasound endoscope for imaging in aggressive liquids Proceedings of Meetings on Acoustics 35, 055001 (2018); https://doi.org/10.1121/2.0000944 Convolutional neural network for detecting odontocete echolocation clicks The Journal of the Acoustical Society of America 145, EL7 (2019); https://doi.org/10.1121/1.5085647 Three dimensional photoacoustic tomography in Bayesian framework The Journal of the Acoustical Society of America 144, 2061 (2018); https://doi.org/10.1121/1.5057109 Volume 35 http://acousticalsociety.org/ 176th Meeting of Acoustical Society of America 2018 Acoustics Week in Canada Victoria, Canada 5-9 Nov 2018 Signal Processing in Acoustics: Paper 1pSP13 Optimal experimental design for machine learning using the Fisher information Tracianne B.
    [Show full text]