Conjugate Priors
Total Page:16
File Type:pdf, Size:1020Kb
Lecture 7: Conjugate Priors Marina Meil˘a [email protected] Department of Statistics University of Washington February 1, 2018 Exponential families have conjugate priors Examples:Bernoulli model with Beta prior Examples:Multivariate normal with Normal-Inverse Wishart prior Example: Poisson distribution Reading B&S: 5.2, Hoff: 3.3,7.1{3 The posterior pθjx1:n in an exponential family I Exponential family model in canonical form θT x− (θ) pX (x; θ) = c(x)e (1) P Likelihood of sample x1:n with meanx ¯ = ( i xi )=n xT θ− (θ) pxjθ = c(x)e (2) Prior for parameter θ pθ(θ; parameters) the parameters will be specified shortly (3) By Bayes' rule T n(¯x θ− (θ))+ln pθ (θ;parameters) pθjx1:n / C(x1:n)e (4) Q with C(x1:n) = i c(xi ). I Let's look at the exponent T n(x ¯ θ − (θ) ) + ln pθ(θ; parameters) (5) |{z} | {z } bilinear ln Z(θ) I First two terms look like an exponential family in θ. What would it take to make the posterior be an exponential family? T Answer: ln pθ(θ; parameters) = ν0(µ0 θ − (θ))+constant(ν0;µ 0). The conjugate prior I A prior 1 T ν0(µ0 θ− (θ)) pθ(θ; ν0;µ 0) / e (6) Z(ν0;µ 0) is called conjugate prior for the exponential family defined by (1) I The normalization constant is Z φ(ν ;µ ) ν (µ T θ− (θ)) Z(ν0;µ 0) = e 0 0 = e 0 0 dθ (7) Θ I The posterior is now T T nx¯ +ν0µ0 T T (n+ν0) θ− (θ) −φ(ν0;µ0)) n(¯x θ− (θ))+ν0(µ0 θ− (θ)) n+ν0 pθjx1:n / e = e (8) T T nx¯ +ν0µ0 with hyper-parameters ν = n + ν0, µ = Exercise Why did the factor n+ν0 C(x1:n) disappear? I Hence, the ν parameter behaves like an equivalent sample size and the µ parameter like a mean value parameter and νµ like a equivalent sufficient statistic I When n ν0 the influence of the prior becomes neglijible, while for n ν0 , the prior sets the model mean of near µ0 Bernoulli model with Beta prior I See Lecture 6. Multivariate Normal I The multivariate normal distribution in p dimensions is 1 − 1 (x−µ)T Σ−1(x−µ) Normal(x; µ, Σ) = p e 2 (9) (2π)p=2 det Σ p p×p with x; µ 2 R and Σ 2 R positive definite. − 1 X T AX +bT X −1 −1 Remark When pX / e 2 , X ∼ Normal(A b; A ) I Useful to separate prior pµ;Σ = pµjΣpΣ The conjugate prior on µ pµ(µ; µ0; Λ0) = Normal(µ; µ0; Λ0) (10) 1 Pn T I Data sufficient statistics n; x¯; S, S = n i=1(xi − µ)(xi − µ) the sample covariance matrix −1 −1 −1 I The posterior covarianceΛ =Λ 0 + nΣ −1 I Cov is called a precision matrix. −1 −1 I The posterior mean µ =Λ (Λ0 µ0 + nx¯) I Data affects posterior of µ only viax ¯; n Defining prior overΣ I Idea \prior parameter is a sufficient statistic". Hence, conjugate distribution should be the distribution of statistics from Normal(0; S0) p I Assume z1:ν0 ∼ i.i.d.N(0; S0), zi 2 R . Pν0 T I Then S = z z is a covariance matrix (= ν × sample covar of z ), and it is ν0 i0=1 i i 0 1:ν0 non-singular w.p. 1 for ν0 ≥ p. The distribution of S is the Wishart distribution I 0ν0 −1 I We set the conjugate prior forΣ to be this distribution I . and we sayΣ is distributed as the Inverse Wishart Wishart and Inverse Wishart + I The Wishart distribution with ν0 degrees of freedom, over Sp the group of positive definite p × p matrices 1 1 −1 (ν0−p−1)=2 − trace S0 K pK (K; ν0; S0) = det K e 2 (11) ν0p=2 ν0 2 det S0Γp 2 ν0 p(p−1)=4 Qp ν0 j−1 with Γp 2 = π j=1 Γ 2 − 2 −1 I E[Σ ] = ν0S0 1 −1 I E[Σ] = S0 ν0−p−1 −1 I Posterior parameters ν0 + n, S0 + nS(µ) I again, posterior parameters and sufficient statistics combine linearly 1 −1 I Posterior expectation ofΣ= [S0 + nS(µ)] ν0+n−p−1 Univariate Normal and its conjugate prior 1 1 2 2 − 2 (x−µ) pX (x; µ;σ ) = p e 2σ (12) 2πσ2 I Prior p 2 = Normal(µ; µ ;λ ) µjσ ;µ0;λ0 0 0 I Posterior p 2 = Normal(µ; µ;λ ) µjx1:n;σ ;µ0;λ0 1 1 I Posterior mean E[µ] = µ0 + nx¯ λ λ0 I 1=λ0 is equivalent sample size 1 1 1 I Posterior variance = + n 2 λ0 λ0 σ I Precision increases with observing data The Poisson distribution and the Gamma prior 1 −λ x 1 θx−eθ I Poisson distribution PX (x) = x! e λ = Γ(x−1) e with θ = ln λ. I The conjugate prior is then (θµ−eθ )ν θ(νµ) −νeθ pλjµ / e = e e (13) I Changing the variable back to λ we have. dθ = dλ/λ and 1 p = λνµe−νλ / gamma(λ; µν; µ) (14) λjν,µ λ α I Recall that the mean of gamma(α, β) is β ; hence, E[λ] = µ..