The Exponential Family of Distributions
θ>T (x) A(θ) p(x) = h(x) e −
θ vector of parameters T (x) vector of “suf£cient statistics” A(θ) cumulant generating function h(x)
T Key point: x and θ only “mix” in e θ T (x)
1 The Exponential Family of Distributions
θ>T (x) A(θ) p(x) = h(x) e −
To get a normalized distribution, for any θ
A(θ) θ>T (x) p(x) dx = e− h(x) e dx = 1 Z Z so > eA(θ) = h(x) e θ T (x) dx, Z i.e., when T (x) = x, A(θ) is the log of Laplace transform of h(x).
2 Examples
1 x µ 2/(2σ2) Gaussian p(x) = 2 e−k − k x R √2πσ ∈ x 1 x Bernoulli p(x) = α (1 α) − x 0, 1 − ∈ { } n x n x Binomial p(x) = α (1 α) − x 0, 1, 2, . . . , n x − ∈ { } n! n xi Multinomial p(x) = ¡ ¢ α xi 0, 1, 2, . . . , n , xi = n x1!x2!...xn! i=1 i ∈ { } i λx + Exponential p(x) = λ e− Q x R P ∈ −λ Poisson p(x) = e λx x 0, 1, 2, . . . x! ∈ { }
Γ( i αi) αi 1 Dirichlet p(x) = Γ(α ) i xi − xi [0, 1] , i xi = 1 iP i ∈ Q (don’t need to memorize these eQxcept for Gaussian) P
3 Natural Parameter form for Bernoulli
θ>T (x) A(θ) p(x) = h(x) e −
x 1 x p(x) = α (1 α) − − x 1 x = exp log α (1 α) − − = exp [hx log¡ α + (1 x) log¢ i(1 α) ] − − α = exp x log + log (1 α) 1 α − · − ¸ = exp x θ log 1 + eθ − so £ ¡ ¢ ¤ α T (x) = x θ = log A(θ) = log 1 + eθ 1 α − ¡ ¢ 4 Natural Parameter Form for Gaussian
1 (x µ)2/(2σ2) p(x) = e− − √2πσ2 1 x2 µx µ2 = exp log σ + √2π − − 2σ2 σ2 − 2σ2 µ ¶ 1 2 2 = exp θ>T (x) log σ µ /(2σ ) √2π − − ¡ A(θ) ¢ h(x) | {z } where | {z }
2 2 µ x µ/σ A(θ) = 2σ2 + log σ T (x) = θ = 2 2 2 [θ]1 1 x 1/(2σ ) = log ( 2[θ]2) − − 4[θ]2 − 2 −
5 Natural Parameter Form for Multivariate Gaussian
θ>T (x) A(θ) p(x) = h(x) e −
1 (x µ)Σ−1(x µ)/2 p(x) = e− − − (2π)D/2 Σ 1/2 | |
1 D/2 x Σ− µ h(x) = (2π)− T (x) = θ = 1 1 x x> Σ− − 2
6 The £rst derivative of A(θ)
> A(θ) = log h(x) e θ T (x) dx · Z ¸ Q(θ) | {z } dA(θ) 1 dQ(θ) Q (θ) = = 0 dθ Q(θ) dθ Q(θ) > h(x) e θ T (x) T (x) dx = θ>T (x) R h(x) e dx θ>T (x) A(θ) hR(x) e − T (x) dx = θ>T (x) A(θ) R h(x) e − dx = E [T (x)] . pθR
7 The second derivative of A(θ)
> A(θ) = log h(x) e θ T (x) dx · Z ¸ Q(θ)
| {z } 2 dA(θ) d Q0(θ) d 1 Q00(θ) (Q0(θ)) = = Q0(θ) = dθ dθ Q(θ) dθ Q(θ) Q(θ) − 2 · ¸ · ¸ (Q(θ)) > h(x) e θ T (x) T 2(x) dx = (E [T (x)])2 θ>T (x) pθ R h(x) e dx − θ>T (x) A(θ) 2 hR(x) e − T (x) dx 2 = (E [T (x)]) θ>T (x) A(θ) pθ R h(x) e − dx − = E T 2(x) (E [T (x)])2 = Cov [T (x)] 0. pθR − pθ pθ º = A(θ) is convex.£ ( means¤ positive de£nite) ⇒ º 8 Maximum Likelihood
N N `(θ) = log p ( x θ ) = log h(x ) + T (x ) A(θ) i | i i − i=1 i=1 X Xh i To £nd maxmimum likelihood solution
N T `0(θ) = θ T (x ) NA0(θ) i − " i=1 # X So ML solution satis£es 1 N A0(θˆ ) = T (x ) = 0 ML N i i=1 X (is θˆML a consistent estimator then ?) 1 N Suf£cient statistics N i=1 T (xi) summarize data. When can’t do this analytically: convexity = unique global ML P ⇒ solution for θ.
9 Products
Products of E-family distributions are E-family distributions
T T θ T (x) A(θ1) θ T (x) A(θ2) h(x) e 1 − h(x) e 2 − = ×
³ ´ ³ ´(θ1+θ2)T (x) A˜(θ1,θ2) h˜(x) e − but might not have a nice parametric form any more.
But the product of two Gaussians is always a Gaussian.
10 Conjugate Priors in Bayesian Statistics
p ( x θ ) p(θ) p ( θ x ) = | | p ( x θ ) p(θ) dθ | Note: denominator not a functionR of θ just normalizing term ⇒ p(θ) p ( x θ ) p(θ) p ( θ x ) p ( x θ ) p(θ) −→ | −→ | ∝ | parametric parametric mess?
Conjugacy:|{z} require |p(θ{z) and} p ( θ x ) to be of the same| for{zm. E.g.} | p(θ) p ( x θ ) p(θ) p ( θ x ) −→ | −→ | Dirichlet Multinomial Dirichlet p(θ) and p ( x|{z}θ ) are then |called{z conjugate} distrib| {zutions.} |
11 Example: Dirichlet and Multinomial
Γ ( i αi) αi 1 p(θ) = θ − Dirichlet in θ Γ(x) = (x 1)! Γ (α ) − i i i P Y Q( x )! n p ( x θ ) = i i θxi Multinomial in x | x !x ! . . . x ! i 1 2 n i=1 P Y xi+αi 1 p ( θ x ) p ( θ x ) p(θ) = junk θ − | ∝ | × i i Y which is again Dirichlet, so we must have
Γ ( i αi + xi) xi+αi 1 p ( θ x ) = θ − . | Γ (α + x ) i i i i i P Y Remember pseudocount ofQ1? That was just a Dirichlet prior.
12 Conjugate Pairs
Prior Conditional 2 2 2 2 µ µ0 /(2σ ) x µ /(2σ ) Gaussian e−k − k Gaussian e−k − k Γ(r+s) r 1 s 1 x 1 x Beta α − (2 α) − Bernoulli α (1 α) − Γ(r)Γ(s) − − Γ( αi) αi 1 ( xi)! xi Dirichlet θ − Multinomial θ Γ(αi) i xi! i P P Inv. Wishart Q Q Gaussian (cov) Q Q
Note: Conjugacy is mutual, e.g.
Dirichlet Multinomial Dirichlet → → Multinomial Dirichlet Multinomial → →
13