Fisher Information Matrix for Gaussian and Categorical Distributions

Fisher information matrix for Gaussian and categorical distributions Jakub M. Tomczak November 28, 2012 1 Notations Let x be a random variable. Consider a parametric distribution of x with parameters θ, p(xjθ). The contiuous random variable x 2 R can be modelled by normal distribution (Gaussian distribution): 1 n (x − µ)2 o p(xjθ) = p exp − 2πσ2 2σ2 = N (xjµ, σ2); (1) where θ = µ σ2T. A discrete (categorical) variable x 2 X , X is a finite set of K values, can be modelled by categorical distribution:1 K Y xk p(xjθ) = θk k=1 = Cat(xjθ); (2) P where 0 ≤ θk ≤ 1, k θk = 1. For X = f0; 1g we get a special case of the categorical distribution, Bernoulli distribution, p(xjθ) = θx(1 − θ)1−x = Bern(xjθ): (3) 2 Fisher information matrix 2.1 Definition The Fisher score is determined as follows [1]: g(θ; x) = rθ ln p(xjθ): (4) The Fisher information matrix is defined as follows [1]: T F = Ex g(θ; x) g(θ; x) : (5) 1We use the 1-of-K encoding [1]. 1 2.2 Example 1: Bernoulli distribution Let us calculate the fisher matrix for Bernoulli distribution (3). First, we need to take the logarithm: ln Bern(xjθ) = x ln θ + (1 − x) ln(1 − θ): (6) Second, we need to calculate the derivative: d x 1 − x ln Bern(xjθ) = − dθ θ 1 − θ x − θ = : (7) θ(1 − θ) Hence, we get the following Fisher score for the Bernoulli distribution: x − θ g(θ; x) = : (8) θ(1 − θ) The Fisher information matrix (here it is a scalar) for the Bernoulli distribution is as follows: F = Ex[g(θ; x) g(θ; x)] h (x − θ)2 i = Ex (θ(1 − θ))2 1 n o = [x2 − 2xθ + θ2] (θ(1 − θ))2 Ex 1 n o = [x2] − 2θ [x] + θ2 (θ(1 − θ))2 Ex Ex 1 n o = θ − 2θ2 + θ2 (θ(1 − θ))2 1 = θ(1 − θ) (θ(1 − θ))2 1 = : (9) θ(1 − θ) 2.3 Example 2: Categorical distribution Let us calculate the fisher matrix for categorical distribution (2). First, we need to take the logarithm: K X ln Cat(xjθ) = xk ln θk: (10) k=1 Second, we need to calculate partial derivatives: @ x ln Cat(xjθ) = k : (11) @θk θk Hence, we get the following Fisher score for the categorical distribution: 2 x1 3 θk 6 . 7 g(θ; x) = 4 . 5 : (12) xK θK 2 Now, let us calculate the product of Fisher score and its transposition: 2 2 x x x x x 3 2 x1 3 1 1 2 1 K θ θ θ ··· θ θ θk 6 1 1 2 1 K 7 6 . 7 x1 ··· xK = 6 . 7 4 . 5 θk θK 6 . ··· . 7 xK 4 25 xK x1 xK x2 xK θK ··· θK θ1 θK θ2 θK 2 3 g11 g12 ··· g1K 6 . 7 = 4 . ··· . 5 : (13) gK1 gK2 ··· gKK Therefore, for gkk we have: 2 hxk i Ex[gkk] = Ex θk 1 2 = 2 Ex[x ] θk 1 = ; (14) θk and for gij, i 6= j: hxixj i Ex[gij] = Ex θiθj 1 = Ex[xixj] θiθj = 0: (15) Finally, we get: n 1 1 o F = diag ;:::; : (16) θ1 θK 2.4 Example 3: Normal distribution Let us calculate the Fisher matrix for univariate normal distribution (1). First, we need to take the logarithm: 1 1 1 ln N (xjµ, σ2) = − ln 2π − ln σ2 − (x − µ)2: (17) 2 2 2σ2 Second, we need to calculate the partial derivatives: @ 1 N (xjµ, σ2) = (x − µ) (18) @µ σ2 @ 1 1 N (xjµ, σ2) = − + (x − µ)2: (19) @σ2 2σ2 2σ4 Hence, we get the following Fisher score for normal distribution: @ 2 @µ N (xjµ, σ ) g(θ; x) = @ 2 @σ2 N (xjµ, σ ) 1 σ2 (x − µ) = 1 1 2 : (20) − 2σ2 + 2σ4 (x − µ) 3 Now, let us calculate the product of Fisher score and its transposition: 2 1 3 σ2 (x − µ) 1 1 1 2 4 5 σ2 (x − µ) − 2σ2 + 2σ4 (x − µ) = 1 1 2 − 2σ2 + 2σ4 (x − µ) 2 1 2 1 1 3 3 σ4 (x − µ) − 2σ4 (x − µ) + 2σ6 (x − µ) 4 5 = (21) 1 1 3 1 1 2 1 4 − 2σ4 (x − µ) + 2σ6 (x − µ) 4σ4 − 2σ6 (x − µ) + 4σ8 (x − µ) 2 3 g11 g12 4 5 ; g21 g22 where g12 = g21. In order to calculate the Fisher information matrix we need to determine the expected value of 2 all gij. Hence, for g11: 1 [g ] = (x − µ)2 Ex 11 Ex σ4 1 = [x2] − 2µ2 + µ2 σ2 Ex 1 = µ2 + σ2 − 2µ2 + µ2 σ2 1 = ; (22) σ2 and for g12: 1 1 [g ] = − (x − µ) + (x − µ)3 Ex 12 Ex 2σ4 2σ6 1 1 = − ( [x] − µ) + (x3 − 3x2µ + 3xµ2 − µ3) 2σ4 Ex Ex 2σ6 1 = ( [x3] − 3µ [x2] + 3µ2 [x] − µ3) 2σ6 Ex Ex Ex 1 = µ3 + 3µσ2 − 3µ(µ2 + σ2) + 3µ3 − µ3 2σ6 = 0; (23) and for g22: 1 1 1 [g ] = − (x − µ)2 + (x − µ)4 Ex 22 Ex 4σ4 2σ6 4σ8 1 1 1 = − [x2 − 2xµ + µ2] + [x4 − 4x3µ + 6x2µ2 − 4xµ3 + µ4] 4σ4 2σ6 Ex 4σ8 Ex 1 1 1 = − [x2] − 2 [x]µ + µ2 + [x4] − 4 [x3]µ + 6 [x2]µ2 − 4 [x]µ3 + µ4 4σ4 2σ6 Ex Ex 4σ8 Ex Ex Ex Ex 1 1 1 = − σ2 + 3σ4 4σ4 2σ6 4σ8 1 = : (24) 2σ4 Finally, we get: 1 σ2 0 F = 1 : (25) 0 2σ4 2See Section 3 for raw moments of univariate normal distribution. 4 2.5 Summary The Fisher information matrix for given distribution: • Bernoulli distribution: 1 F = ; θ(1 − θ) • Categorical distribution: n 1 1 o F = diag ;:::; ; θ1 θK • Normal distribution: 1 σ2 0 F = 1 : 0 2σ4 3 Appendix: Raw moments Table 1: The raw moments of univariate normal distribution. Order Expression Raw moment 1 Ex[x] µ 2 2 2 2 Ex[x ] µ + σ 3 3 2 3 Ex[x ] µ + 3µσ 4 4 2 2 4 4 Ex[x ] µ + 6µ σ + 3µ References [1] C. Bishop. Pattern Recognition and Machine Learning. Springer, 2006. 5.

Fisher Information Matrix for Gaussian and Categorical Distributions

UC Riverside UC Riverside Previously Published Works

Statistical Estimation in Multivariate Normal Distribution

A Sufficiency Paradox: an Insufficient Statistic Preserving the Fisher

Package 'Distributional'

A Note on Inference in a Bivariate Normal Distribution Model Jaya

Risk, Scores, Fisher Information, and Glrts (Supplementary Material for Math 494) Stanley Sawyer — Washington University Vs

Evaluating Fisher Information in Order Statistics Mary F

Information-Geometric Optimization Algorithms: a Unifying Picture Via Invariance Principles

Matrix Algebraic Properties of the Fisher Information Matrix of Stationary Processes

Categorical Distributions in Natural Language Processing Version 0.1

Efficient Monte Carlo Computation of Fisher Information Matrix Using Prior

Information-Geometric Optimization Algorithms: a Unifying Picture Via Invariance Principles