<<

Information Geometry: A Geometry-free Introduction

Shane Gu, Nilesh Tripuraneni [email protected]

April 2, 2015 Outline

Maximum Entropy Principle

Exponential Families

Information Projection

Fisher Information

Reference

2 of 21 Motivating Example

Problem: model the distribution of English words (in a corpus, etc.)

Domain: S ={English words}

If no is observed, a sensible default is uniform over S

If some data is observed, and some simple are collected, e.g.: Pr(length > 5) = 0.3 Pr(end in ’e’) = 0.45 Pr(start with ’s’) = 0.08 then, what should the distribution be?

Intuition: choose a distribution as random as possible but satisfies the above statistics

3 of 21 Entropy

Entropy: a of , or “unpredictability of information content” Given X has distribution p(x), entropy is:H(X) = − P p(x)log p(x) Nice properties:

I Expandibility if X has distribution p1, p2, ..., pn, and Y has distribution {p1, p2, ..., pn, 0}, H(X) = H(Y ) I Symmetry e.g. {p, 1 − p} has same entropy as 1 − p, p I Additivity if X and Y and independent, H(X, Y ) = H(X) + H(Y ) I Subadditivity H(X, Y ) ≤ H(X) + H(Y ) I Normalization A fair coin has 1 bit entropy I “Small for small probability” Entropy of a coin with bias p goes to 0 as p goes to 0 H(X) is the only measure that satisfies the above six properties [Aczel´ et al., 1974]

4 of 21 Maximum Entropy Principle

Back to the problem, given features Ti (x), e.g. 1(length > 5), we collect statistics ETi (x) = bi P max H(p) = − x p(x)log p(x) P x p(x)Ti (x) = bi , i = 1...k p(x) ≥ 0, x ∈ S P x p(x) = 1

I A convex optimization problem with linear constraints!

I Finds a distribution that satisfies the statistics and is as random as possible, i.e. as close to uniform as possible

I What if our prior is not uniform?

5 of 21 KL

Kullback-Leibler divergence (relative entropy): a distance measure between two distributions, p, q.

P p(x) P KL(p||q) = x p(x)log q(x) = −H(p) + x p(x)log q(x)

Properties:

I KL(p||q) ≥ 0, with equality if p = q

I KL(p||q) could be infinite

I KL(p||q) is not KL(q||p)

6 of 21 MaxEnt (alternate formulation)

Given a prior π

P p(x) min KL(p||π) = x p(x)log π(x) P x p(x)Ti (x) = bi , i = 1...k p(x) ≥ 0, x ∈ S P x p(x) = 1

I Finds a distribution that satisfies the statistics and is as close to the prior π as possible

7 of 21 MaxEnt: solution Using Lagrange multipliers: F(p, λ, ν) = P p(x) P P P x p(x)log π(x) + i λi ( x p(x)Ti (x) − bi ) + ν( x p(x) − 1) ∂F p P ∂p = 1 + log π + i λi Ti − ν = 0

P 1 i λi Ti (x) Solution: p(x) = Z e π(x) (p(x) ≥ 0, x ∈ S is automatically satisfied)

I the exponential family generated by π D I input space: S ∈ R D I base measure π : R −→ R I features Ti (x), i = 1, ..., k

8 of 21 Exponential families

Given: D I input space: S ∈ R D I base measure π : R −→ R I features T (x) = T1(x), ..., Tk (x) Exponential family generated by T and π consists of log linear models parametrized by natural parameters η ∈ Rk : η·T (x) pη(x) ∝ e π(x) η·T (x)−G(η) pη(x) = e π(x)

P η·T (x) Log partition function: G(η) = log x e π(x) Natural parameter space: N = {η ∈ Rk : − inf < G(η) < inf}

9 of 21 Properties of exponential families

P η·T (x) recall log partition function: G(η) = log x e π(x) P η·T (x) 0 x e π(x)T (x) I G (η) = P η·T (x) = ET (x) x e π(x) 00 I G (η) = varT (x) > 0, etc. 0 I G(η) is strictly concave → G (η) is 1-to-1 I η ↔ ET (x) has one-to-one mapping (natural paramters ↔ expectation parameters) Maximum likelihood estimation:

I observed data: xi , i = 1, ..., m 1 P 1 P I L = m i logp(x) = m i (η · T (x) − G(η) + logπ(x)) ∂L 1 P 0 I ∂η = m i T (x) − G (η) = Ex∼pdata T (x) − Ex∼pT (x) 0 1 I set to zero: G (η) = m T (x)

I MLE is to find η s.t. Ex∼pT (x) = Ex∼pdata T (x)

10 of 21 Example: Univariate Gaussian

(x−µ)2 1 − p(x; µ, σ) = √ e 2σ2 2πσ2 1 (x,x2)·(µ/σ2,−1/2σ2)−(µ2/2σ2+logσ) I p(x; µ, σ) = √ e 2π √ I S = R, base measure π(x) = 1/ 2π 2 2 2 I T (x) = (x, x ), η = (µ/σ , −1/2σ ) 2 2 2 η 1 1 I G(η) = µ /2σ + logσ = − 1 + log| | 4η2 2 2η2 − I η ∈ R × R

11 of 21 Example: RBM

Restricted Boltzmann Machine, more generally, Exponential Family Harmoniums [Welling et al., 2004]

I observed variables X = xi ; i = 1, ..., nx , latent variables V = vj ; j = 1, ..., nv P b ·f (x )+P b ·g (v )+P f (x )T ·w ·g (v ) I p(X, V ) ∝ e i x,i i i j v,j j j i,j i i ij j j

I p(X|V ), P(V |X) factorizes over p(xi |V ), p(vj |X) respectively P b ·f (x )+P G ·(b +P f (x )T ·w ) I p(X) ∝ e i x,i i i j v,j v,j i i i ij

I η = {{bx,i }, {bv,j }, {wij }} Contrastive Divergence: ∂L I ∝< f (x ) >p − < f (x ) >p ∂bx,i i i data i i I CD is gradient ascent in natural parameter space wrt MLE objective

12 of 21 MaxEnt restatement Given: D I input space: S ∈ R D I base measure π : R −→ R I features T (x) = T1(x), ..., Tk (x) I constraints ET (x) = b If there’s a distribution of the form p∗(x) = eη·T (x)−G(η)π(x) satisfying the constraints, it is the unique minimizer of KL(p, π) subject to these constraints.

13 of 21 Proof consider any other distribution p satisfying EpT (x) = b

X p(x) X p∗(x) KL(p, π) − KL(p∗, π) = p(x)log − p∗(x)log π(x) π(x) x x X p(x) X = p(x)log − p∗(x)(η · T (x) − G(η)) π(x) x x X p(x) X = p(x)log − p(x)(η · T (x) − G(η)) π(x) x x X p(x) X p∗(x) = p(x)log − p(x)log π(x) π(x) x x X p(x) = p(x)log = KL(p, p∗) > 0 p∗(x) x (1)

KL(p, π) = KL(p, p∗) + KL(p∗, π)

14 of 21 Geometry

http://videolectures.net/mlss05us_dasgupta_ig/

15 of 21 More geometry

http://videolectures.net/mlss05us_dasgupta_ig/

16 of 21 Metric

Given pθ(x), Fisher information matrix Iθ is defined as: T ˙ ˙T Iθ := Eθ[∇θlog pθ(x)∇θlog pθ(x) ] = Eθ[`θ`θ ] Fisher score function: `˙θ = ∇θlog pθ(x) ˙ Eθ[`θ] = 0

Relation to Hessian of log pθ(x): 2 T 2 2 ∇ pθ(x) ∇pθ(x)∇pθ(x) ∇ pθ(x) ˙ ˙T ∇ log pθ(x) = − 2 = − `θ` pθ(x) pθ(x) pθ(x) θ R 2 R 2 2 Iθ = − pθ(x)∇ log pθ(x)dx + ∇ pθ(x)dx = −Eθ[∇ log pθ(x)]

Relation to KL Divergence: 2 0 2 0 Iθ = ∇θ0 KL(θ, θ )|θ0=θ = ∇θ0 KL(θ , θ)|θ0=θ

Fisher information depends on parametrization θ!

17 of 21 Fisher Informaton and KL

For any appropriately smooth family of distributions {pθ}θ∈Θ: 1 KL(pθ1 , pθ2 ) ≈ 2 < θ1 − θ2, Iθ1 (θ1 − θ2) > if θ1 ≈ θ2

Proof:By Taylor expansion of log pθ2 (x) around θ1: 1 log pθ2 (x) = log pθ1 (x)+ < ∇log pθ1 (x), θ1 − θ2 > + 2 < 2 θ1 − θ2, ∇ log pθ(x)(θ1 − θ2) > +R(θ1, θ2, x) 3 R(θ1, θ2, x) = O(kθ1 − θ2k ) is the remainder term. Then:

1 Eθ1 [log pθ2 (x)] ≈ Eθ1 [log pθ1 (x)] − 2 < θ1 − θ2, Iθ1 (θ1 − θ2) > 1 T KL(pθ1 , pθ2 ) ≈ 2 (θ1 − θ2) Iθ1 (θ1 − θ2) 1 2 I η − η0 = Iθ (θ − θ0) 1 2 I KL(pη, pη0 ) ≈ KL(pη0 , pη) ≈ 2 kη − η0k I local properties no longer depend on parametrization

18 of 21 Natural gradient gradient: I the gradient = the direction with the highest increase in the objective per step length ∇f (θ) I arg max f (θ + δθ) =  k∇f (θ)k2 δθ:kδθk2= I the state after the gradient step depends on parametrization natural gradient I the gradient = the direction with the highest increase in the objective per change in KL I T arg max f (θ + δθ) = arg max f (θ) + ∇θf (θ) δθ δθ:KL(θ,θ+δθ)= 1 T δθ: 2 δθ Iθδθ= −1 ∝ Iθ ∇θf (θ) (2)

I does not depend on parametrization

19 of 21 I, KL, and exponential families

η·T (x)−G(η) Given pη(x) = e π(x) ˙ 0 `η = ∇η(η ·T (x)−G(η)+logπ(x)) = T (x)−G (η) = T (x)−Eη[T (x)] 00 Iη = Varη[T (x)] = G (η) ∂θ Let θ = EηT (x), we know Iη = ∂η :

∂θ X ∂pη(x) i = T (x) ∂η ∂η i j x j X = pη(x)(Tj (x) − Eη[Tj (x)])Ti (x) (3) x = Eη[Ti (x)Tj (x)] − Eη[Ti (x)]Eη[Tj (x)] = Iη,ij

Since natural gradient for η, θ are respectively: −1 ∂f −1 ∂θ ∂f ∂f ∂f g(η) = Iη ∂η = Iη ∂η ∂θ = ∂θ and similarly, g(θ) = ∂η −→ gradient in one parameter space gives natural gradient in the other [Hensman et al., 2012]

20 of 21 Reference (1) http://videolectures.net/mlss05us_dasgupta_ig/ https://hips.seas.harvard.edu/blog/2013/04/08/ fisher-information/ https://web.stanford.edu/class/stats311/Lectures/ lec-09.pdf http://www.cs.berkeley.edu/˜pabbeel/cs287-fa09/ lecture-notes/lecture20-6pp.pdf

21 of 21 Aczel,´ J., Forte, B., and Ng, C. (1974). Why the shannon and hartley entropies are’natural’. Advances in Applied Probability, pages 131–146. Hensman, J., Rattray, M., and Lawrence, N. D. (2012). Fast variational inference in the conjugate exponential family. In Advances in Neural Information Processing Systems, pages 2888–2896. Welling, M., Rosen-Zvi, M., and Hinton, G. E. (2004). Exponential family harmoniums with an application to information retrieval. In Advances in neural information processing systems, pages 1481–1488.

21 of 21