Fisher information matrix for Gaussian and categorical distributions
Jakub M. Tomczak November 28, 2012
1 Notations
Let x be a random variable. Consider a parametric distribution of x with parameters θ, p(x|θ). The contiuous random variable x ∈ R can be modelled by normal distribution (Gaussian distribution): 1 n (x − µ)2 o p(x|θ) = √ exp − 2πσ2 2σ2 = N (x|µ, σ2), (1) where θ = µ σ2T. A discrete (categorical) variable x ∈ X , X is a finite set of K values, can be modelled by categorical distribution:1
K Y xk p(x|θ) = θk k=1 = Cat(x|θ), (2) P where 0 ≤ θk ≤ 1, k θk = 1. For X = {0, 1} we get a special case of the categorical distribution, Bernoulli distribution,
p(x|θ) = θx(1 − θ)1−x = Bern(x|θ). (3)
2 Fisher information matrix
2.1 Definition The Fisher score is determined as follows [1]:
g(θ, x) = ∇θ ln p(x|θ). (4)
The Fisher information matrix is defined as follows [1]:
T F = Ex g(θ, x) g(θ, x) . (5)
1We use the 1-of-K encoding [1].
1 2.2 Example 1: Bernoulli distribution Let us calculate the fisher matrix for Bernoulli distribution (3). First, we need to take the logarithm:
ln Bern(x|θ) = x ln θ + (1 − x) ln(1 − θ). (6) Second, we need to calculate the derivative: d x 1 − x ln Bern(x|θ) = − dθ θ 1 − θ x − θ = . (7) θ(1 − θ)
Hence, we get the following Fisher score for the Bernoulli distribution: x − θ g(θ, x) = . (8) θ(1 − θ)
The Fisher information matrix (here it is a scalar) for the Bernoulli distribution is as follows:
F = Ex[g(θ, x) g(θ, x)] h (x − θ)2 i = Ex (θ(1 − θ))2 1 n o = [x2 − 2xθ + θ2] (θ(1 − θ))2 Ex 1 n o = [x2] − 2θ [x] + θ2 (θ(1 − θ))2 Ex Ex 1 n o = θ − 2θ2 + θ2 (θ(1 − θ))2 1 = θ(1 − θ) (θ(1 − θ))2 1 = . (9) θ(1 − θ)
2.3 Example 2: Categorical distribution Let us calculate the fisher matrix for categorical distribution (2). First, we need to take the logarithm:
K X ln Cat(x|θ) = xk ln θk. (10) k=1 Second, we need to calculate partial derivatives: ∂ x ln Cat(x|θ) = k . (11) ∂θk θk Hence, we get the following Fisher score for the categorical distribution:
x1 θk . g(θ, x) = . . (12) xK θK
2 Now, let us calculate the product of Fisher score and its transposition:
2 x x x x x x1 1 1 2 1 K θ θ θ ··· θ θ θk 1 1 2 1 K . x1 ··· xK = . . . . θk θK . . ··· . xK 2 xK x1 xK x2 xK θK ··· θK θ1 θK θ2 θK g11 g12 ··· g1K . . . = . . ··· . . (13) gK1 gK2 ··· gKK
Therefore, for gkk we have:
2 hxk i Ex[gkk] = Ex θk 1 2 = 2 Ex[x ] θk 1 = , (14) θk and for gij, i 6= j:
hxixj i Ex[gij] = Ex θiθj 1 = Ex[xixj] θiθj = 0. (15)
Finally, we get: n 1 1 o F = diag ,..., . (16) θ1 θK
2.4 Example 3: Normal distribution Let us calculate the Fisher matrix for univariate normal distribution (1). First, we need to take the logarithm: 1 1 1 ln N (x|µ, σ2) = − ln 2π − ln σ2 − (x − µ)2. (17) 2 2 2σ2 Second, we need to calculate the partial derivatives: ∂ 1 N (x|µ, σ2) = (x − µ) (18) ∂µ σ2 ∂ 1 1 N (x|µ, σ2) = − + (x − µ)2. (19) ∂σ2 2σ2 2σ4 Hence, we get the following Fisher score for normal distribution:
∂ 2 ∂µ N (x|µ, σ ) g(θ, x) = ∂ 2 ∂σ2 N (x|µ, σ ) 1 σ2 (x − µ) = 1 1 2 . (20) − 2σ2 + 2σ4 (x − µ)
3 Now, let us calculate the product of Fisher score and its transposition: 1 σ2 (x − µ) 1 1 1 2 σ2 (x − µ) − 2σ2 + 2σ4 (x − µ) = 1 1 2 − 2σ2 + 2σ4 (x − µ) 1 2 1 1 3 σ4 (x − µ) − 2σ4 (x − µ) + 2σ6 (x − µ) = (21) 1 1 3 1 1 2 1 4 − 2σ4 (x − µ) + 2σ6 (x − µ) 4σ4 − 2σ6 (x − µ) + 4σ8 (x − µ) g11 g12 , g21 g22 where g12 = g21. In order to calculate the Fisher information matrix we need to determine the expected value of 2 all gij. Hence, for g11:
1 [g ] = (x − µ)2 Ex 11 Ex σ4 1 = [x2] − 2µ2 + µ2 σ2 Ex 1 = µ2 + σ2 − 2µ2 + µ2 σ2 1 = , (22) σ2 and for g12: 1 1 [g ] = − (x − µ) + (x − µ)3 Ex 12 Ex 2σ4 2σ6 1 1 = − ( [x] − µ) + (x3 − 3x2µ + 3xµ2 − µ3) 2σ4 Ex Ex 2σ6 1 = ( [x3] − 3µ [x2] + 3µ2 [x] − µ3) 2σ6 Ex Ex Ex 1 = µ3 + 3µσ2 − 3µ(µ2 + σ2) + 3µ3 − µ3 2σ6 = 0, (23) and for g22: 1 1 1 [g ] = − (x − µ)2 + (x − µ)4 Ex 22 Ex 4σ4 2σ6 4σ8 1 1 1 = − [x2 − 2xµ + µ2] + [x4 − 4x3µ + 6x2µ2 − 4xµ3 + µ4] 4σ4 2σ6 Ex 4σ8 Ex 1 1 1 = − [x2] − 2 [x]µ + µ2 + [x4] − 4 [x3]µ + 6 [x2]µ2 − 4 [x]µ3 + µ4 4σ4 2σ6 Ex Ex 4σ8 Ex Ex Ex Ex 1 1 1 = − σ2 + 3σ4 4σ4 2σ6 4σ8 1 = . (24) 2σ4 Finally, we get: 1 σ2 0 F = 1 . (25) 0 2σ4 2See Section 3 for raw moments of univariate normal distribution.
4 2.5 Summary The Fisher information matrix for given distribution:
• Bernoulli distribution: 1 F = , θ(1 − θ) • Categorical distribution: n 1 1 o F = diag ,..., , θ1 θK • Normal distribution: 1 σ2 0 F = 1 . 0 2σ4 3 Appendix: Raw moments
Table 1: The raw moments of univariate normal distribution.
Order Expression Raw moment 1 Ex[x] µ 2 2 2 2 Ex[x ] µ + σ 3 3 2 3 Ex[x ] µ + 3µσ 4 4 2 2 4 4 Ex[x ] µ + 6µ σ + 3µ
References
[1] C. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
5