
Fisher information matrix for Gaussian and categorical distributions Jakub M. Tomczak November 28, 2012 1 Notations Let x be a random variable. Consider a parametric distribution of x with parameters θ, p(xjθ). The contiuous random variable x 2 R can be modelled by normal distribution (Gaussian distribution): 1 n (x − µ)2 o p(xjθ) = p exp − 2πσ2 2σ2 = N (xjµ, σ2); (1) where θ = µ σ2T. A discrete (categorical) variable x 2 X , X is a finite set of K values, can be modelled by categorical distribution:1 K Y xk p(xjθ) = θk k=1 = Cat(xjθ); (2) P where 0 ≤ θk ≤ 1, k θk = 1. For X = f0; 1g we get a special case of the categorical distribution, Bernoulli distribution, p(xjθ) = θx(1 − θ)1−x = Bern(xjθ): (3) 2 Fisher information matrix 2.1 Definition The Fisher score is determined as follows [1]: g(θ; x) = rθ ln p(xjθ): (4) The Fisher information matrix is defined as follows [1]: T F = Ex g(θ; x) g(θ; x) : (5) 1We use the 1-of-K encoding [1]. 1 2.2 Example 1: Bernoulli distribution Let us calculate the fisher matrix for Bernoulli distribution (3). First, we need to take the logarithm: ln Bern(xjθ) = x ln θ + (1 − x) ln(1 − θ): (6) Second, we need to calculate the derivative: d x 1 − x ln Bern(xjθ) = − dθ θ 1 − θ x − θ = : (7) θ(1 − θ) Hence, we get the following Fisher score for the Bernoulli distribution: x − θ g(θ; x) = : (8) θ(1 − θ) The Fisher information matrix (here it is a scalar) for the Bernoulli distribution is as follows: F = Ex[g(θ; x) g(θ; x)] h (x − θ)2 i = Ex (θ(1 − θ))2 1 n o = [x2 − 2xθ + θ2] (θ(1 − θ))2 Ex 1 n o = [x2] − 2θ [x] + θ2 (θ(1 − θ))2 Ex Ex 1 n o = θ − 2θ2 + θ2 (θ(1 − θ))2 1 = θ(1 − θ) (θ(1 − θ))2 1 = : (9) θ(1 − θ) 2.3 Example 2: Categorical distribution Let us calculate the fisher matrix for categorical distribution (2). First, we need to take the logarithm: K X ln Cat(xjθ) = xk ln θk: (10) k=1 Second, we need to calculate partial derivatives: @ x ln Cat(xjθ) = k : (11) @θk θk Hence, we get the following Fisher score for the categorical distribution: 2 x1 3 θk 6 . 7 g(θ; x) = 4 . 5 : (12) xK θK 2 Now, let us calculate the product of Fisher score and its transposition: 2 2 x x x x x 3 2 x1 3 1 1 2 1 K θ θ θ ··· θ θ θk 6 1 1 2 1 K 7 6 . 7 x1 ··· xK = 6 . 7 4 . 5 θk θK 6 . ··· . 7 xK 4 25 xK x1 xK x2 xK θK ··· θK θ1 θK θ2 θK 2 3 g11 g12 ··· g1K 6 . 7 = 4 . ··· . 5 : (13) gK1 gK2 ··· gKK Therefore, for gkk we have: 2 hxk i Ex[gkk] = Ex θk 1 2 = 2 Ex[x ] θk 1 = ; (14) θk and for gij, i 6= j: hxixj i Ex[gij] = Ex θiθj 1 = Ex[xixj] θiθj = 0: (15) Finally, we get: n 1 1 o F = diag ;:::; : (16) θ1 θK 2.4 Example 3: Normal distribution Let us calculate the Fisher matrix for univariate normal distribution (1). First, we need to take the logarithm: 1 1 1 ln N (xjµ, σ2) = − ln 2π − ln σ2 − (x − µ)2: (17) 2 2 2σ2 Second, we need to calculate the partial derivatives: @ 1 N (xjµ, σ2) = (x − µ) (18) @µ σ2 @ 1 1 N (xjµ, σ2) = − + (x − µ)2: (19) @σ2 2σ2 2σ4 Hence, we get the following Fisher score for normal distribution: @ 2 @µ N (xjµ, σ ) g(θ; x) = @ 2 @σ2 N (xjµ, σ ) 1 σ2 (x − µ) = 1 1 2 : (20) − 2σ2 + 2σ4 (x − µ) 3 Now, let us calculate the product of Fisher score and its transposition: 2 1 3 σ2 (x − µ) 1 1 1 2 4 5 σ2 (x − µ) − 2σ2 + 2σ4 (x − µ) = 1 1 2 − 2σ2 + 2σ4 (x − µ) 2 1 2 1 1 3 3 σ4 (x − µ) − 2σ4 (x − µ) + 2σ6 (x − µ) 4 5 = (21) 1 1 3 1 1 2 1 4 − 2σ4 (x − µ) + 2σ6 (x − µ) 4σ4 − 2σ6 (x − µ) + 4σ8 (x − µ) 2 3 g11 g12 4 5 ; g21 g22 where g12 = g21. In order to calculate the Fisher information matrix we need to determine the expected value of 2 all gij. Hence, for g11: 1 [g ] = (x − µ)2 Ex 11 Ex σ4 1 = [x2] − 2µ2 + µ2 σ2 Ex 1 = µ2 + σ2 − 2µ2 + µ2 σ2 1 = ; (22) σ2 and for g12: 1 1 [g ] = − (x − µ) + (x − µ)3 Ex 12 Ex 2σ4 2σ6 1 1 = − ( [x] − µ) + (x3 − 3x2µ + 3xµ2 − µ3) 2σ4 Ex Ex 2σ6 1 = ( [x3] − 3µ [x2] + 3µ2 [x] − µ3) 2σ6 Ex Ex Ex 1 = µ3 + 3µσ2 − 3µ(µ2 + σ2) + 3µ3 − µ3 2σ6 = 0; (23) and for g22: 1 1 1 [g ] = − (x − µ)2 + (x − µ)4 Ex 22 Ex 4σ4 2σ6 4σ8 1 1 1 = − [x2 − 2xµ + µ2] + [x4 − 4x3µ + 6x2µ2 − 4xµ3 + µ4] 4σ4 2σ6 Ex 4σ8 Ex 1 1 1 = − [x2] − 2 [x]µ + µ2 + [x4] − 4 [x3]µ + 6 [x2]µ2 − 4 [x]µ3 + µ4 4σ4 2σ6 Ex Ex 4σ8 Ex Ex Ex Ex 1 1 1 = − σ2 + 3σ4 4σ4 2σ6 4σ8 1 = : (24) 2σ4 Finally, we get: 1 σ2 0 F = 1 : (25) 0 2σ4 2See Section 3 for raw moments of univariate normal distribution. 4 2.5 Summary The Fisher information matrix for given distribution: • Bernoulli distribution: 1 F = ; θ(1 − θ) • Categorical distribution: n 1 1 o F = diag ;:::; ; θ1 θK • Normal distribution: 1 σ2 0 F = 1 : 0 2σ4 3 Appendix: Raw moments Table 1: The raw moments of univariate normal distribution. Order Expression Raw moment 1 Ex[x] µ 2 2 2 2 Ex[x ] µ + σ 3 3 2 3 Ex[x ] µ + 3µσ 4 4 2 2 4 4 Ex[x ] µ + 6µ σ + 3µ References [1] C. Bishop. Pattern Recognition and Machine Learning. Springer, 2006. 5.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages5 Page
-
File Size-