Fisher for Gaussian and categorical distributions

Jakub M. Tomczak November 28, 2012

1 Notations

Let x be a . Consider a parametric distribution of x with θ, p(x|θ). The contiuous random variable x ∈ R can be modelled by (Gaussian distribution): 1 n (x − µ)2 o p(x|θ) = √ exp − 2πσ2 2σ2 = N (x|µ, σ2), (1) where θ = µ σ2T. A discrete (categorical) variable x ∈ X , X is a finite set of K values, can be modelled by categorical distribution:1

K Y xk p(x|θ) = θk k=1 = Cat(x|θ), (2) P where 0 ≤ θk ≤ 1, k θk = 1. For X = {0, 1} we get a special case of the categorical distribution, ,

p(x|θ) = θx(1 − θ)1−x = Bern(x|θ). (3)

2 Fisher information matrix

2.1 Definition The Fisher is determined as follows [1]:

g(θ, x) = ∇θ ln p(x|θ). (4)

The Fisher information matrix is defined as follows [1]:

 T F = Ex g(θ, x) g(θ, x) . (5)

1We use the 1-of-K encoding [1].

1 2.2 Example 1: Bernoulli distribution Let us calculate the fisher matrix for Bernoulli distribution (3). First, we need to take the logarithm:

ln Bern(x|θ) = x ln θ + (1 − x) ln(1 − θ). (6) Second, we need to calculate the derivative: d x 1 − x ln Bern(x|θ) = − dθ θ 1 − θ x − θ = . (7) θ(1 − θ)

Hence, we get the following Fisher score for the Bernoulli distribution: x − θ g(θ, x) = . (8) θ(1 − θ)

The Fisher information matrix (here it is a scalar) for the Bernoulli distribution is as follows:

F = Ex[g(θ, x) g(θ, x)] h (x − θ)2 i = Ex (θ(1 − θ))2 1 n o = [x2 − 2xθ + θ2] (θ(1 − θ))2 Ex 1 n o = [x2] − 2θ [x] + θ2 (θ(1 − θ))2 Ex Ex 1 n o = θ − 2θ2 + θ2 (θ(1 − θ))2 1 = θ(1 − θ) (θ(1 − θ))2 1 = . (9) θ(1 − θ)

2.3 Example 2: Categorical distribution Let us calculate the fisher matrix for categorical distribution (2). First, we need to take the logarithm:

K X ln Cat(x|θ) = xk ln θk. (10) k=1 Second, we need to calculate partial derivatives: ∂ x ln Cat(x|θ) = k . (11) ∂θk θk Hence, we get the following Fisher score for the categorical distribution:

 x1  θk  .  g(θ, x) =  .  . (12) xK θK

2 Now, let us calculate the product of Fisher score and its transposition:

2  x  x x x x   x1  1 1 2 1 K θ θ θ ··· θ θ θk  1 1 2 1 K   .   x1 ··· xK  =  . . .   .  θk θK  . . ··· .  xK   2 xK x1 xK x2 xK θK ··· θK θ1 θK θ2 θK   g11 g12 ··· g1K  . . .  =  . . ··· .  . (13) gK1 gK2 ··· gKK

Therefore, for gkk we have:

2 hxk  i Ex[gkk] = Ex θk 1 2 = 2 Ex[x ] θk 1 = , (14) θk and for gij, i 6= j:

hxixj i Ex[gij] = Ex θiθj 1 = Ex[xixj] θiθj = 0. (15)

Finally, we get: n 1 1 o F = diag ,..., . (16) θ1 θK

2.4 Example 3: Normal distribution Let us calculate the Fisher matrix for univariate normal distribution (1). First, we need to take the logarithm: 1 1 1 ln N (x|µ, σ2) = − ln 2π − ln σ2 − (x − µ)2. (17) 2 2 2σ2 Second, we need to calculate the partial derivatives: ∂ 1 N (x|µ, σ2) = (x − µ) (18) ∂µ σ2 ∂ 1 1 N (x|µ, σ2) = − + (x − µ)2. (19) ∂σ2 2σ2 2σ4 Hence, we get the following Fisher score for normal distribution:

 ∂ 2  ∂µ N (x|µ, σ ) g(θ, x) = ∂ 2 ∂σ2 N (x|µ, σ )  1  σ2 (x − µ) = 1 1 2 . (20) − 2σ2 + 2σ4 (x − µ)

3 Now, let us calculate the product of Fisher score and its transposition:  1  σ2 (x − µ)  1 1 1 2   σ2 (x − µ) − 2σ2 + 2σ4 (x − µ) = 1 1 2 − 2σ2 + 2σ4 (x − µ)  1 2 1 1 3  σ4 (x − µ) − 2σ4 (x − µ) + 2σ6 (x − µ)   = (21) 1 1 3 1 1 2 1 4 − 2σ4 (x − µ) + 2σ6 (x − µ) 4σ4 − 2σ6 (x − µ) + 4σ8 (x − µ)   g11 g12   , g21 g22 where g12 = g21. In order to calculate the Fisher information matrix we need to determine the of 2 all gij. Hence, for g11:

 1  [g ] = (x − µ)2 Ex 11 Ex σ4 1 = [x2] − 2µ2 + µ2 σ2 Ex 1 = µ2 + σ2 − 2µ2 + µ2 σ2 1 = , (22) σ2 and for g12:  1 1  [g ] = − (x − µ) + (x − µ)3 Ex 12 Ex 2σ4 2σ6 1  1  = − ( [x] − µ) + (x3 − 3x2µ + 3xµ2 − µ3) 2σ4 Ex Ex 2σ6 1   = ( [x3] − 3µ [x2] + 3µ2 [x] − µ3) 2σ6 Ex Ex Ex 1   = µ3 + 3µσ2 − 3µ(µ2 + σ2) + 3µ3 − µ3 2σ6 = 0, (23) and for g22:  1 1 1  [g ] = − (x − µ)2 + (x − µ)4 Ex 22 Ex 4σ4 2σ6 4σ8 1 1 1 = − [x2 − 2xµ + µ2] + [x4 − 4x3µ + 6x2µ2 − 4xµ3 + µ4] 4σ4 2σ6 Ex 4σ8 Ex 1 1   1   = − [x2] − 2 [x]µ + µ2 + [x4] − 4 [x3]µ + 6 [x2]µ2 − 4 [x]µ3 + µ4 4σ4 2σ6 Ex Ex 4σ8 Ex Ex Ex Ex 1 1 1 = − σ2 + 3σ4 4σ4 2σ6 4σ8 1 = . (24) 2σ4 Finally, we get:  1  σ2 0 F = 1 . (25) 0 2σ4 2See Section 3 for raw moments of univariate normal distribution.

4 2.5 Summary The Fisher information matrix for given distribution:

• Bernoulli distribution: 1 F = , θ(1 − θ) • Categorical distribution: n 1 1 o F = diag ,..., , θ1 θK • Normal distribution:  1  σ2 0 F = 1 . 0 2σ4 3 Appendix: Raw moments

Table 1: The raw moments of univariate normal distribution.

Order Expression Raw 1 Ex[x] µ 2 2 2 2 Ex[x ] µ + σ 3 3 2 3 Ex[x ] µ + 3µσ 4 4 2 2 4 4 Ex[x ] µ + 6µ σ + 3µ

References

[1] C. Bishop. Pattern Recognition and . Springer, 2006.

5