Conjugate Priors, Uninformative Priors

Conjugate Priors, Uninformative Priors Nasim Zolaktaf UBC Machine Learning Reading Group January 2016 Outline Exponential Families Conjugacy Conjugate priors Mixture of conjugate prior Uninformative priors Jeffreys prior = h(X) exp[θT φ(X) − A(θ)] (2) where Z Z(θ) = h(X) exp[θT φ(X)]dx (3) X m A(θ) = log Z(θ) (4) Z(θ) is called the partition function, A(θ) is called the log partition function or cumulant function, and h(X) is a scaling constant. Equation 2 can be generalized by writing p(Xjθ) = h(X) exp[η(θ)T φ(X) − A(η(θ))] (5) The Exponential Family A probability mass function (pmf) or probability distribution m d function (pdf) p(Xjθ), for X = (X1; :::; Xm) 2 X and θ ⊆ R , is said to be in the exponential family if it is the form: 1 p(Xjθ) = h(X) exp[θT φ(X)] (1) Z(θ) where Z Z(θ) = h(X) exp[θT φ(X)]dx (3) X m A(θ) = log Z(θ) (4) Z(θ) is called the partition function, A(θ) is called the log partition function or cumulant function, and h(X) is a scaling constant. Equation 2 can be generalized by writing p(Xjθ) = h(X) exp[η(θ)T φ(X) − A(η(θ))] (5) The Exponential Family A probability mass function (pmf) or probability distribution m d function (pdf) p(Xjθ), for X = (X1; :::; Xm) 2 X and θ ⊆ R , is said to be in the exponential family if it is the form: 1 p(Xjθ) = h(X) exp[θT φ(X)] (1) Z(θ) = h(X) exp[θT φ(X) − A(θ)] (2) Z(θ) is called the partition function, A(θ) is called the log partition function or cumulant function, and h(X) is a scaling constant. Equation 2 can be generalized by writing p(Xjθ) = h(X) exp[η(θ)T φ(X) − A(η(θ))] (5) The Exponential Family A probability mass function (pmf) or probability distribution m d function (pdf) p(Xjθ), for X = (X1; :::; Xm) 2 X and θ ⊆ R , is said to be in the exponential family if it is the form: 1 p(Xjθ) = h(X) exp[θT φ(X)] (1) Z(θ) = h(X) exp[θT φ(X) − A(θ)] (2) where Z Z(θ) = h(X) exp[θT φ(X)]dx (3) X m A(θ) = log Z(θ) (4) Equation 2 can be generalized by writing p(Xjθ) = h(X) exp[η(θ)T φ(X) − A(η(θ))] (5) The Exponential Family A probability mass function (pmf) or probability distribution m d function (pdf) p(Xjθ), for X = (X1; :::; Xm) 2 X and θ ⊆ R , is said to be in the exponential family if it is the form: 1 p(Xjθ) = h(X) exp[θT φ(X)] (1) Z(θ) = h(X) exp[θT φ(X) − A(θ)] (2) where Z Z(θ) = h(X) exp[θT φ(X)]dx (3) X m A(θ) = log Z(θ) (4) Z(θ) is called the partition function, A(θ) is called the log partition function or cumulant function, and h(X) is a scaling constant. The Exponential Family A probability mass function (pmf) or probability distribution m d function (pdf) p(Xjθ), for X = (X1; :::; Xm) 2 X and θ ⊆ R , is said to be in the exponential family if it is the form: 1 p(Xjθ) = h(X) exp[θT φ(X)] (1) Z(θ) = h(X) exp[θT φ(X) − A(θ)] (2) where Z Z(θ) = h(X) exp[θT φ(X)]dx (3) X m A(θ) = log Z(θ) (4) Z(θ) is called the partition function, A(θ) is called the log partition function or cumulant function, and h(X) is a scaling constant. Equation 2 can be generalized by writing p(Xjθ) = h(X) exp[η(θ)T φ(X) − A(η(θ))] (5) This can equivalently be written as n θ p(xjθ) = exp(x log( ) + n log(1 − θ)) (7) x 1 − θ which shows that the Binomial distribution is an exponential family, whose natural parameter is θ η = log (8) 1 − θ Binomial Distribution As an example of a discrete exponential family, consider the Binomial distribution with known number of trials n. The pmf for this distribution is n p(xjθ) = Binomial(n; θ) = θx(1 − θ)n−x; x 2 f0; 1; ::; ng (6) x Binomial Distribution As an example of a discrete exponential family, consider the Binomial distribution with known number of trials n. The pmf for this distribution is n p(xjθ) = Binomial(n; θ) = θx(1 − θ)n−x; x 2 f0; 1; ::; ng (6) x This can equivalently be written as n θ p(xjθ) = exp(x log( ) + n log(1 − θ)) (7) x 1 − θ which shows that the Binomial distribution is an exponential family, whose natural parameter is θ η = log (8) 1 − θ If the posterior distribution p(θjX) are in the same family as the prior probability distribution p(θ), the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood function p(Xjθ). All members of the exponential family have conjugate priors. Conjugacy Consider the posterior distribution p(θjX) with prior p(θ) and likelihood function p(xjθ), where p(θjX) / p(Xjθ)p(θ). All members of the exponential family have conjugate priors. Conjugacy Consider the posterior distribution p(θjX) with prior p(θ) and likelihood function p(xjθ), where p(θjX) / p(Xjθ)p(θ). If the posterior distribution p(θjX) are in the same family as the prior probability distribution p(θ), the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood function p(Xjθ). Conjugacy Consider the posterior distribution p(θjX) with prior p(θ) and likelihood function p(xjθ), where p(θjX) / p(Xjθ)p(θ). If the posterior distribution p(θjX) are in the same family as the prior probability distribution p(θ), the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood function p(Xjθ). All members of the exponential family have conjugate priors. Brief List of Conjugate Models p(θjx) / θ(x+a−1)(1 − θ)n−x+b−1 (10) The posterior distribution is simply a Beta(x + a; n − x + b) distribution. Effectively, our prior is just adding a − 1 successes and b − 1 failures to the dataset. The Conjugate Beta Prior The Beta distribution is conjugate to the Binomial distribution. p(θjx) = p(xjθ)p(θ)= Binomial(n; θ) ∗ Beta(a; b)= n Γ(a + b) θx(1 − θ)n−x θ(a−1)(1 − θ)b−1 (9) x Γ(a)Γ(b) / θx(1 − θ)n−xθ(a−1)(1 − θ)b−1 The posterior distribution is simply a Beta(x + a; n − x + b) distribution. Effectively, our prior is just adding a − 1 successes and b − 1 failures to the dataset. The Conjugate Beta Prior The Beta distribution is conjugate to the Binomial distribution. p(θjx) = p(xjθ)p(θ)= Binomial(n; θ) ∗ Beta(a; b)= n Γ(a + b) θx(1 − θ)n−x θ(a−1)(1 − θ)b−1 (9) x Γ(a)Γ(b) / θx(1 − θ)n−xθ(a−1)(1 − θ)b−1 p(θjx) / θ(x+a−1)(1 − θ)n−x+b−1 (10) The Conjugate Beta Prior The Beta distribution is conjugate to the Binomial distribution. p(θjx) = p(xjθ)p(θ)= Binomial(n; θ) ∗ Beta(a; b)= n Γ(a + b) θx(1 − θ)n−x θ(a−1)(1 − θ)b−1 (9) x Γ(a)Γ(b) / θx(1 − θ)n−xθ(a−1)(1 − θ)b−1 p(θjx) / θ(x+a−1)(1 − θ)n−x+b−1 (10) The posterior distribution is simply a Beta(x + a; n − x + b) distribution. Effectively, our prior is just adding a − 1 successes and b − 1 failures to the dataset. Use a Beta prior for probability θ of ‘heads’, θ ∼ Beta (a; b), θa−1(1 − θ)b−1 p(θja; b) = / θa−1(1 − θ)b−1 B(a; b) Γ(a)Γ(b) with B(a; b) = Γ(a+b) . Remember that probabilities sum to one so we have Z 1 Z 1 θa−1(1 − θ)b−1 1 Z 1 1 = p(θja; b)dθ = dθ = θa−1(1 − θ)b−1dθ 0 0 B(a; b) B(a; b) 0 which helps us compute integrals since we have Z 1 θa−1(1 − θ)b−1dθ = B(a; b): 0 Coin Flipping Example: Model Use a Bernoulli likelihood for coin X landing ‘heads’, p(X = `H0jθ) = θ; p(X = `T 0jθ) = 1 − θ; 0 0 p(Xjθ) = θI(X=`H )(1 − θ)I(X=`T ): Remember that probabilities sum to one so we have Z 1 Z 1 θa−1(1 − θ)b−1 1 Z 1 1 = p(θja; b)dθ = dθ = θa−1(1 − θ)b−1dθ 0 0 B(a; b) B(a; b) 0 which helps us compute integrals since we have Z 1 θa−1(1 − θ)b−1dθ = B(a; b): 0 Coin Flipping Example: Model Use a Bernoulli likelihood for coin X landing ‘heads’, p(X = `H0jθ) = θ; p(X = `T 0jθ) = 1 − θ; 0 0 p(Xjθ) = θI(X=`H )(1 − θ)I(X=`T ): Use a Beta prior for probability θ of ‘heads’, θ ∼ Beta (a; b), θa−1(1 − θ)b−1 p(θja; b) = / θa−1(1 − θ)b−1 B(a; b) Γ(a)Γ(b) with B(a; b) = Γ(a+b) .

Conjugate Priors, Uninformative Priors

Overall Objective Priors

The Exponential Family 1 Definition

Machine Learning Conjugate Priors and Monte Carlo Methods

Polynomial Singular Value Decompositions of a Family of Source-Channel Models

A Compendium of Conjugate Priors

Bayesian Filtering: from Kalman Filters to Particle Filters, and Beyond ZHE CHEN

36-463/663: Hierarchical Linear Models

A Geometric View of Conjugate Priors

Gibbs Sampling, Exponential Families and Orthogonal Polynomials

Sparsity Enforcing Priors in Inverse Problems Via Normal Variance

Jeffreys Priors 1 Priors for the Multivariate Gaussian

Bayesian Inference for the Multivariate Normal