The Exponential Family of Distributions

θ>T (x) A(θ) p(x) = h(x) e −

θ vector of parameters T (x) vector of “suf£cient ” A(θ) generating function h(x)

T Key point: x and θ only “mix” in e θ T (x)

1 The Exponential Family of Distributions

θ>T (x) A(θ) p(x) = h(x) e −

To get a normalized distribution, for any θ

A(θ) θ>T (x) p(x) dx = e− h(x) e dx = 1 Z Z so > eA(θ) = h(x) e θ T (x) dx, Z i.e., when T (x) = x, A(θ) is the log of Laplace transform of h(x).

2 Examples

1 x µ 2/(2σ2) Gaussian p(x) = 2 e−k − k x R √2πσ ∈ x 1 x Bernoulli p(x) = α (1 α) − x 0, 1 − ∈ { } n x n x Binomial p(x) = α (1 α) − x 0, 1, 2, . . . , n x − ∈ { } n! n xi Multinomial p(x) = ¡ ¢ α xi 0, 1, 2, . . . , n , xi = n x1!x2!...xn! i=1 i ∈ { } i λx + Exponential p(x) = λ e− Q x R P ∈ −λ Poisson p(x) = e λx x 0, 1, 2, . . . x! ∈ { }

Γ( i αi) αi 1 Dirichlet p(x) = Γ(α ) i xi − xi [0, 1] , i xi = 1 iP i ∈ Q (don’t need to memorize these eQxcept for Gaussian) P

3 Natural Parameter form for Bernoulli

θ>T (x) A(θ) p(x) = h(x) e −

x 1 x p(x) = α (1 α) − − x 1 x = exp log α (1 α) − − = exp [hx log¡ α + (1 x) log¢ i(1 α) ] − − α = exp x log + log (1 α) 1 α − · − ¸ = exp x θ log 1 + eθ − so £ ¡ ¢ ¤ α T (x) = x θ = log A(θ) = log 1 + eθ 1 α − ¡ ¢ 4 Natural Parameter Form for Gaussian

1 (x µ)2/(2σ2) p(x) = e− − √2πσ2 1 x2 µx µ2 = exp log σ + √2π − − 2σ2 σ2 − 2σ2 µ ¶ 1 2 2 = exp θ>T (x) log σ µ /(2σ ) √2π − − ¡ A(θ) ¢ h(x) | {z } where | {z }

2 2 µ x µ/σ A(θ) = 2σ2 + log σ T (x) = θ = 2 2 2 [θ]1 1  x   1/(2σ )  = log ( 2[θ]2) − − 4[θ]2 − 2 −    

5 Natural Parameter Form for Multivariate Gaussian

θ>T (x) A(θ) p(x) = h(x) e −

1 (x µ)Σ−1(x µ)/2 p(x) = e− − − (2π)D/2 Σ 1/2 | |

1 D/2 x Σ− µ h(x) = (2π)− T (x) = θ = 1 1  x x>   Σ−  − 2    

6 The £rst of A(θ)

> A(θ) = log h(x) e θ T (x) dx · Z ¸ Q(θ) | {z } dA(θ) 1 dQ(θ) Q (θ) = = 0 dθ Q(θ) dθ Q(θ) > h(x) e θ T (x) T (x) dx = θ>T (x) R h(x) e dx θ>T (x) A(θ) hR(x) e − T (x) dx = θ>T (x) A(θ) R h(x) e − dx = E [T (x)] . pθR

7 The second derivative of A(θ)

> A(θ) = log h(x) e θ T (x) dx · Z ¸ Q(θ)

| {z } 2 dA(θ) d Q0(θ) d 1 Q00(θ) (Q0(θ)) = = Q0(θ) = dθ dθ Q(θ) dθ Q(θ) Q(θ) − 2 · ¸ · ¸ (Q(θ)) > h(x) e θ T (x) T 2(x) dx = (E [T (x)])2 θ>T (x) pθ R h(x) e dx − θ>T (x) A(θ) 2 hR(x) e − T (x) dx 2 = (E [T (x)]) θ>T (x) A(θ) pθ R h(x) e − dx − = E T 2(x) (E [T (x)])2 = Cov [T (x)] 0. pθR − pθ pθ º = A(θ) is convex.£ ( ¤ positive de£nite) ⇒ º 8 Maximum Likelihood

N N `(θ) = log p ( x θ ) = log h(x ) + T (x ) A(θ) i | i i − i=1 i=1 X Xh i To £nd maxmimum likelihood solution

N T `0(θ) = θ T (x ) NA0(θ) i − " i=1 # X So ML solution satis£es 1 N A0(θˆ ) = T (x ) = 0 ML N i i=1 X (is θˆML a consistent estimator then ?) 1 N Suf£cient statistics N i=1 T (xi) summarize . When can’t do this analytically: convexity = unique global ML P ⇒ solution for θ.

9 Products

Products of E-family distributions are E-family distributions

T T θ T (x) A(θ1) θ T (x) A(θ2) h(x) e 1 − h(x) e 2 − = ×

³ ´ ³ ´(θ1+θ2)T (x) A˜(θ1,θ2) h˜(x) e − but might not have a nice parametric form any more.

But the product of two Gaussians is always a Gaussian.

10 Conjugate Priors in

p ( x θ ) p(θ) p ( θ x ) = | | p ( x θ ) p(θ) dθ | Note: denominator not a functionR of θ just normalizing term ⇒ p(θ) p ( x θ ) p(θ) p ( θ x ) p ( x θ ) p(θ) −→ | −→ | ∝ | parametric parametric mess?

Conjugacy:|{z} require |p(θ{z) and} p ( θ x ) to be of the same| for{zm. E.g.} | p(θ) p ( x θ ) p(θ) p ( θ x ) −→ | −→ | Dirichlet Multinomial Dirichlet p(θ) and p ( x|{z}θ ) are then |called{z conjugate} distrib| {zutions.} |

11 Example: Dirichlet and Multinomial

Γ ( i αi) αi 1 p(θ) = θ − Dirichlet in θ Γ(x) = (x 1)! Γ (α ) − i i i P Y Q( x )! n p ( x θ ) = i i θxi Multinomial in x | x !x ! . . . x ! i 1 2 n i=1 P Y xi+αi 1 p ( θ x ) p ( θ x ) p(θ) = junk θ − | ∝ | × i i Y which is again Dirichlet, so we must have

Γ ( i αi + xi) xi+αi 1 p ( θ x ) = θ − . | Γ (α + x ) i i i i i P Y Remember pseudocount ofQ1? That was just a Dirichlet prior.

12 Conjugate Pairs

Prior Conditional 2 2 2 2 µ µ0 /(2σ ) x µ /(2σ ) Gaussian e−k − k Gaussian e−k − k Γ(r+s) r 1 s 1 x 1 x Beta α − (2 α) − Bernoulli α (1 α) − Γ(r)Γ(s) − − Γ( αi) αi 1 ( xi)! xi Dirichlet θ − Multinomial θ Γ(αi) i xi! i P P Inv. Wishart Q Q Gaussian (cov) Q Q

Note: Conjugacy is mutual, e.g.

Dirichlet Multinomial Dirichlet → → Multinomial Dirichlet Multinomial → →

13