Fisher Information Matrix for Gaussian and Categorical Distributions

Fisher Information Matrix for Gaussian and Categorical Distributions

Fisher information matrix for Gaussian and categorical distributions Jakub M. Tomczak November 28, 2012 1 Notations Let x be a random variable. Consider a parametric distribution of x with parameters θ, p(xjθ). The contiuous random variable x 2 R can be modelled by normal distribution (Gaussian distribution): 1 n (x − µ)2 o p(xjθ) = p exp − 2πσ2 2σ2 = N (xjµ, σ2); (1) where θ = µ σ2T. A discrete (categorical) variable x 2 X , X is a finite set of K values, can be modelled by categorical distribution:1 K Y xk p(xjθ) = θk k=1 = Cat(xjθ); (2) P where 0 ≤ θk ≤ 1, k θk = 1. For X = f0; 1g we get a special case of the categorical distribution, Bernoulli distribution, p(xjθ) = θx(1 − θ)1−x = Bern(xjθ): (3) 2 Fisher information matrix 2.1 Definition The Fisher score is determined as follows [1]: g(θ; x) = rθ ln p(xjθ): (4) The Fisher information matrix is defined as follows [1]: T F = Ex g(θ; x) g(θ; x) : (5) 1We use the 1-of-K encoding [1]. 1 2.2 Example 1: Bernoulli distribution Let us calculate the fisher matrix for Bernoulli distribution (3). First, we need to take the logarithm: ln Bern(xjθ) = x ln θ + (1 − x) ln(1 − θ): (6) Second, we need to calculate the derivative: d x 1 − x ln Bern(xjθ) = − dθ θ 1 − θ x − θ = : (7) θ(1 − θ) Hence, we get the following Fisher score for the Bernoulli distribution: x − θ g(θ; x) = : (8) θ(1 − θ) The Fisher information matrix (here it is a scalar) for the Bernoulli distribution is as follows: F = Ex[g(θ; x) g(θ; x)] h (x − θ)2 i = Ex (θ(1 − θ))2 1 n o = [x2 − 2xθ + θ2] (θ(1 − θ))2 Ex 1 n o = [x2] − 2θ [x] + θ2 (θ(1 − θ))2 Ex Ex 1 n o = θ − 2θ2 + θ2 (θ(1 − θ))2 1 = θ(1 − θ) (θ(1 − θ))2 1 = : (9) θ(1 − θ) 2.3 Example 2: Categorical distribution Let us calculate the fisher matrix for categorical distribution (2). First, we need to take the logarithm: K X ln Cat(xjθ) = xk ln θk: (10) k=1 Second, we need to calculate partial derivatives: @ x ln Cat(xjθ) = k : (11) @θk θk Hence, we get the following Fisher score for the categorical distribution: 2 x1 3 θk 6 . 7 g(θ; x) = 4 . 5 : (12) xK θK 2 Now, let us calculate the product of Fisher score and its transposition: 2 2 x x x x x 3 2 x1 3 1 1 2 1 K θ θ θ ··· θ θ θk 6 1 1 2 1 K 7 6 . 7 x1 ··· xK = 6 . 7 4 . 5 θk θK 6 . ··· . 7 xK 4 25 xK x1 xK x2 xK θK ··· θK θ1 θK θ2 θK 2 3 g11 g12 ··· g1K 6 . 7 = 4 . ··· . 5 : (13) gK1 gK2 ··· gKK Therefore, for gkk we have: 2 hxk i Ex[gkk] = Ex θk 1 2 = 2 Ex[x ] θk 1 = ; (14) θk and for gij, i 6= j: hxixj i Ex[gij] = Ex θiθj 1 = Ex[xixj] θiθj = 0: (15) Finally, we get: n 1 1 o F = diag ;:::; : (16) θ1 θK 2.4 Example 3: Normal distribution Let us calculate the Fisher matrix for univariate normal distribution (1). First, we need to take the logarithm: 1 1 1 ln N (xjµ, σ2) = − ln 2π − ln σ2 − (x − µ)2: (17) 2 2 2σ2 Second, we need to calculate the partial derivatives: @ 1 N (xjµ, σ2) = (x − µ) (18) @µ σ2 @ 1 1 N (xjµ, σ2) = − + (x − µ)2: (19) @σ2 2σ2 2σ4 Hence, we get the following Fisher score for normal distribution: @ 2 @µ N (xjµ, σ ) g(θ; x) = @ 2 @σ2 N (xjµ, σ ) 1 σ2 (x − µ) = 1 1 2 : (20) − 2σ2 + 2σ4 (x − µ) 3 Now, let us calculate the product of Fisher score and its transposition: 2 1 3 σ2 (x − µ) 1 1 1 2 4 5 σ2 (x − µ) − 2σ2 + 2σ4 (x − µ) = 1 1 2 − 2σ2 + 2σ4 (x − µ) 2 1 2 1 1 3 3 σ4 (x − µ) − 2σ4 (x − µ) + 2σ6 (x − µ) 4 5 = (21) 1 1 3 1 1 2 1 4 − 2σ4 (x − µ) + 2σ6 (x − µ) 4σ4 − 2σ6 (x − µ) + 4σ8 (x − µ) 2 3 g11 g12 4 5 ; g21 g22 where g12 = g21. In order to calculate the Fisher information matrix we need to determine the expected value of 2 all gij. Hence, for g11: 1 [g ] = (x − µ)2 Ex 11 Ex σ4 1 = [x2] − 2µ2 + µ2 σ2 Ex 1 = µ2 + σ2 − 2µ2 + µ2 σ2 1 = ; (22) σ2 and for g12: 1 1 [g ] = − (x − µ) + (x − µ)3 Ex 12 Ex 2σ4 2σ6 1 1 = − ( [x] − µ) + (x3 − 3x2µ + 3xµ2 − µ3) 2σ4 Ex Ex 2σ6 1 = ( [x3] − 3µ [x2] + 3µ2 [x] − µ3) 2σ6 Ex Ex Ex 1 = µ3 + 3µσ2 − 3µ(µ2 + σ2) + 3µ3 − µ3 2σ6 = 0; (23) and for g22: 1 1 1 [g ] = − (x − µ)2 + (x − µ)4 Ex 22 Ex 4σ4 2σ6 4σ8 1 1 1 = − [x2 − 2xµ + µ2] + [x4 − 4x3µ + 6x2µ2 − 4xµ3 + µ4] 4σ4 2σ6 Ex 4σ8 Ex 1 1 1 = − [x2] − 2 [x]µ + µ2 + [x4] − 4 [x3]µ + 6 [x2]µ2 − 4 [x]µ3 + µ4 4σ4 2σ6 Ex Ex 4σ8 Ex Ex Ex Ex 1 1 1 = − σ2 + 3σ4 4σ4 2σ6 4σ8 1 = : (24) 2σ4 Finally, we get: 1 σ2 0 F = 1 : (25) 0 2σ4 2See Section 3 for raw moments of univariate normal distribution. 4 2.5 Summary The Fisher information matrix for given distribution: • Bernoulli distribution: 1 F = ; θ(1 − θ) • Categorical distribution: n 1 1 o F = diag ;:::; ; θ1 θK • Normal distribution: 1 σ2 0 F = 1 : 0 2σ4 3 Appendix: Raw moments Table 1: The raw moments of univariate normal distribution. Order Expression Raw moment 1 Ex[x] µ 2 2 2 2 Ex[x ] µ + σ 3 3 2 3 Ex[x ] µ + 3µσ 4 4 2 2 4 4 Ex[x ] µ + 6µ σ + 3µ References [1] C. Bishop. Pattern Recognition and Machine Learning. Springer, 2006. 5.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    5 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us