<<

Lecture 7: Conjugate Priors

Marina Meil˘a [email protected]

Department of University of Washington

February 1, 2018 Exponential families have conjugate priors

Examples:Bernoulli model with Beta prior

Examples:Multivariate normal with Normal-Inverse Wishart prior

Example:

Reading B&S: 5.2, Hoff: 3.3,7.1–3 The posterior pθ|x1:n in an

I Exponential family model in canonical form

θT x−ψ(θ) pX (x; θ) = c(x)e (1) P Likelihood of sample x1:n with meanx ¯ = ( i xi )/n

xT θ−ψ(θ) px|θ = c(x)e (2)

Prior for parameter θ

pθ(θ; parameters) the parameters will be specified shortly (3)

By Bayes’ rule

T n(¯x θ−ψ(θ))+ln pθ (θ;parameters) pθ|x1:n ∝ C(x1:n)e (4) Q with C(x1:n) = i c(xi ).

I Let’s look at the exponent

T n(x ¯ θ − ψ(θ) ) + ln pθ(θ; parameters) (5) |{z} | {z } bilinear ln Z(θ)

I First two terms look like an exponential family in θ. What would it take to make the posterior be an exponential family? T Answer: ln pθ(θ; parameters) = ν0(µ0 θ − ψ(θ))+constant(ν0,µ 0). The

I A prior 1 T ν0(µ0 θ−ψ(θ)) pθ(θ; ν0,µ 0) ∝ e (6) Z(ν0,µ 0) is called conjugate prior for the exponential family defined by (1)

I The normalization constant is Z φ(ν ,µ ) ν (µ T θ−ψ(θ)) Z(ν0,µ 0) = e 0 0 = e 0 0 dθ (7) Θ

I The posterior is now

 T T  nx¯ +ν0µ0 T T (n+ν0) θ−ψ(θ) −φ(ν0,µ0)) n(¯x θ−ψ(θ))+ν0(µ0 θ−ψ(θ)) n+ν0 pθ|x1:n ∝ e = e (8) T T nx¯ +ν0µ0 with hyper-parameters ν = n + ν0, µ = Exercise Why did the factor n+ν0 C(x1:n) disappear?

I Hence, the ν parameter behaves like an equivalent sample size and the µ parameter like a mean value parameter and νµ like a equivalent sufficient statistic

I When n  ν0 the influence of the prior becomes neglijible, while for n  ν0 , the prior sets the model mean of near µ0 Bernoulli model with Beta prior

I See Lecture 6. Multivariate Normal

I The multivariate in p dimensions is

1 − 1 (x−µ)T Σ−1(x−µ) Normal(x, µ, Σ) = √ e 2 (9) (2π)p/2 det Σ

p p×p with x, µ ∈ R and Σ ∈ R positive definite. − 1 X T AX +bT X −1 −1 Remark When pX ∝ e 2 , X ∼ Normal(A b, A )

I Useful to separate prior pµ,Σ = pµ|ΣpΣ The conjugate prior on µ

pµ(µ; µ0, Λ0) = Normal(µ; µ0, Λ0) (10)

1 Pn T I Data sufficient statistics n, x¯, S, S = n i=1(xi − µ)(xi − µ) the sample covariance −1 −1 −1 I The posterior covarianceΛ =Λ 0 + nΣ −1 I Cov is called a matrix. −1 −1 I The posterior mean µ =Λ (Λ0 µ0 + nx¯)

I Data affects posterior of µ only viax ¯, n Defining prior overΣ

I Idea “prior parameter is a sufficient statistic”. Hence, conjugate distribution should be the distribution of statistics from Normal(0, S0) p I Assume z1:ν0 ∼ i.i.d.N(0, S0), zi ∈ R . Pν0 T I Then S = z z is a (= ν × sample covar of z ), and it is ν0 i0=1 i i 0 1:ν0 non-singular w.p. 1 for ν0 ≥ p. The distribution of S is the I 0ν0 −1 I We set the conjugate prior forΣ to be this distribution

I . . . and we sayΣ is distributed as the Inverse Wishart Wishart and Inverse Wishart

+ I The Wishart distribution with ν0 degrees of freedom, over Sp the group of positive definite p × p matrices

1 1 −1 (ν0−p−1)/2 − S0 K pK (K; ν0, S0) = det K e 2 (11) ν0p/2 ν0  2 det S0Γp 2   ν0  p(p−1)/4 Qp ν0 j−1 with Γp 2 = π j=1 Γ 2 − 2 −1 I E[Σ ] = ν0S0 1 −1 I E[Σ] = S0 ν0−p−1 −1 I Posterior parameters ν0 + n, S0 + nS(µ) I again, posterior parameters and sufficient statistics combine linearly 1 −1 I Posterior expectation ofΣ= [S0 + nS(µ)] ν0+n−p−1 Univariate Normal and its conjugate prior

1 1 2 2 − 2 (x−µ) pX (x; µ,σ ) = √ e 2σ (12) 2πσ2

I Prior p 2 = Normal(µ; µ ,λ ) µ|σ ,µ0,λ0 0 0

I Posterior p 2 = Normal(µ; µ,λ ) µ|x1:n,σ ,µ0,λ0 1  1  I Posterior mean E[µ] = µ0 + nx¯ λ λ0

I 1/λ0 is equivalent sample size 1 1 1 I Posterior = + n 2 λ0 λ0 σ I Precision increases with observing data The Poisson distribution and the Gamma prior

1 −λ x 1 θx−eθ I Poisson distribution PX (x) = x! e λ = Γ(x−1) e with θ = ln λ.

I The conjugate prior is then

(θµ−eθ )ν θ(νµ) −νeθ pλ|µ ∝ e = e e (13)

I Changing the variable back to λ we have. dθ = dλ/λ and 1 p = λνµe−νλ ∝ gamma(λ; µν, µ) (14) λ|ν,µ λ α I Recall that the mean of gamma(α, β) is β ; hence, E[λ] = µ.