Lecture 2: Conjugate Priors

CS598JHM: Advanced NLP (Spring ʼ10) The binomial distribution If p is the probability of heads, the probability of getting exactly k heads in n independent yes/no trials is given by Lecture 2: the binomial distribution Bin(n,p): Conjugate priors n k n k P (k heads) = p (1 p) − k − n! k n k = p (1 p) − k!(n k)! − − Julia Hockenmaier Expectation E(Bin(n,p)) = np [email protected] 3324 Siebel Center Variance var(Bin(n,p)) = np(1-p) http://www.cs.uiuc.edu/class/sp10/cs598jhm CS598JHM: Advanced NLP 2 Binomial likelihood Parameter estimation What distribution does p (probability of heads) have, Given a set of data D=HTTHTT, what is the probability ! of given that the data D consists of #H heads and #T tails? heads? Likelihood L(e;D=(#Heads,#Tails)) for binomial distribution 0.007 - Maximum likelihood estimation (MLE): L(e,(5,5)) L(e,(3,7)) Use the ! which has the highest likelihood P(D| !). 0.006 L(e,(2,8)) P (x = H D)=P (x = H θ)withθ = arg max P (D θ) 0.005 | | θ | 0.004 - Bayesian estimation: 0.003 Compute the expectation of ! given D: 1 0.002 P (x = H D)= P (x = H θ)P (θ D)dθ = E[θ D] | 0 | | | 0.001 0 0 0.2 0.4 0.6 0.8 1 CS598JHM: Advancede NLP CS598JHM: Advanced NLP 3 4 Maximum likelihood estimation Bayesian statistics - Data D provides evidence for or against our beliefs. - Maximum likelihood estimation (MLE): We update our belief ! based on the evidence we see: find ! which maximizes likelihood P(D | !). Prior Likelihood P (θ)P (D θ) P (θ D)= | | P (θ)P (D θ)dθ θ∗ = arg max P (D θ) θ | Posterior | = arg max θH (1 θ)T Marginal Likelihood (=P(D)) θ − H = H + T CS598JHM: Advanced NLP CS598JHM: Advanced NLP 5 6 Bayesian estimation In search of a prior... Given a prior P(!) and a likelihood P(D|!), We would like something of the form: what is the posterior P(! |D)? P (θ) θa(1 θ)b How do we choose the prior P(!)? ∝ − - The posterior is proportional to prior x likelihood: But -- this looks just like the binomial: P(! |D)! P(!) P(D|!) n k n k P (k heads) = p (1 p) − k − - The likelihood of a binomial is: H T P(D|!) = ! (1-!) n! k n k = p (1 p) − k!(n k)! − - If prior P(!) is proportional to powers of ! and (1-!), − posterior will also be proportional to powers of ! and (1-!): P(!) ! ! a(1-!)b …. except that k is an integer and ! is a real with 0<! <1. " P(! |D)! ! a(1-!)b !H(1-!)T = !a+H(1-!)b+T CS598JHM: Advanced NLP CS598JHM: Advanced NLP 7 8 The Gamma function The Gamma function The Gamma function "(x) is the generalization of the K(x) function factorial x! (or rather (x-1)!) to the reals: 25 ∞ α 1 x 20 Γ(α)= x − e− dx for α > 0 0 15 For x >1, "(x) = (x-1)"(x-1). 10 For positive integers, "(x) = (x-1)! 5 0 0 1 2 3 4 5 CS598JHM: Advanced NLP CS598JHM: Advanced NLP 9 10 The Beta distribution Beta(",#) with " >1, # >1 A random variable X (0 < x < 1) has a Beta distribution with Unimodal (hyper)parameters " (" > 0) and # (# > 0) if X has a continuous Beta Distribution Beta(_, `) distribution with probability density function 7 Beta(1.5,1.5) Beta(3,1.5) 6 Beta(3,3) Γ(α + β) α 1 β 1 Beta(20,20) P (x α, β)= x − (1 x) − Beta(3,20) | Γ(α)Γ(β) − 5 4 The first term is a normalization factor (to obtain a distribution) 3 1 α 1 β 1 Γ(α + β) 2 x − (1 x) − dx = 0 − Γ(α)Γ(β) 1 Expectation: α α+β 0 0 0.2 0.4 0.6 0.8 1 ! CS598JHM: Advanced NLP CS598JHM: Advanced NLP 11 12 Beta(",#) with " <1, # <1 Beta(",#) with "=# U-shaped Symmetric. #=$=1: uniform Beta Distribution Beta(_, `) Beta Distribution Beta(_, `) 6 3.5 Beta(0.1,0.1) Beta(0.1,0.1) Beta(0.1,0.5) Beta(1,1) 5 Beta(0.5,0.5) 3 Beta(2,2) 2.5 4 2 3 1.5 2 1 1 0.5 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 ! ! CS598JHM: Advanced NLP CS598JHM: Advanced NLP 13 14 Beta(",#) with "<1, # >1 Beta(",#) with " = 1, # >1 Strictly decreasing # = 1, 1< $ < 2: strictly concave. Beta Distribution Beta(_, `) # = 1, $ = 2: straight line 8 # = 1, $ > 2: strictly convex Beta(0.1,1.5) Beta(0.5,1.5) Beta Distribution Beta(_, `) 7 Beta(0.5,2) 3.5 Beta(1,1.5) 6 Beta(1,2) 3 Beta(1,3) 5 2.5 4 2 3 1.5 2 1 1 0 0.5 0 0.2 0.4 0.6 0.8 1 ! 0 0 0.2 0.4 0.6 0.8 1 CS598JHM: Advanced NLP CS598JHM: Advanced NLP 15 16 ! Beta as prior for binomial So, what do we predict? Given a prior P(! |#,$) = Beta(#,$), and data D=(H,T), Our Bayesian estimate for the next coin flip P(x=1 | D): what is our posterior? 1 P (θ α, β,H,T) P (H, T θ)P (θ α, β) P (x = H D)= P (x = H θ)P (θ D)dθ | ∝ | | | 0 | | 1 H T α 1 β 1 = θP (θ D)dθ θ (1 θ) θ − (1 θ) − | ∝ − − 0 = E[θ D] | H+α 1 T +β 1 = E[Beta(H + α,T + β)] = θ − (1 θ) − − H + α = H + α + T + β With normalization Γ(H + α + T + β) H+α 1 T +β 1 P (θ α, β,H,T)= θ − (1 θ) − | Γ(H + α)Γ(T + β) − = Beta(α + H, β + T ) CS598JHM: Advanced NLP CS598JHM: Advanced NLP 17 18 Conjugate priors Multinomials: Dirichlet prior The beta distribution is a conjugate prior to the binomial: Multinomial distribution: the resulting posterior is also a beta distribution. Probability of observing each possible outcome ci exactly Xi times in a sequence of n yes/no trials: All members of the exponential family of distributions have n! N P (X =x ,...,X =x )= θx1 θxK if x = n conjugate priors. 1 i K k x ! x ! 1 ··· K i 1 K i=1 ··· Examples: - Multinomial: conjugate prior = Dirichlet Dirichlet prior: - Gaussian: conjugate prior = Gaussian Γ(α1 + ... + αk) αk 1 Dir(θ α1,...αk)= θk − | Γ(α1)...Γ(αk) k=1 CS598JHM: Advanced NLP CS598JHM: Advanced NLP 19 20 More about conjugate priors Todayʼs reading - We can interpret the hyperparameters as “pseudocounts” - Bishop, Pattern Recognition and Machine Learning, Ch. 2 - Sequential estimation (updating counts after each observation) gives same results as batch estimation - Add-one smoothing (Laplace smoothing) = uniform prior - On average, more data leads to a sharper posterior (sharper = lower variance) CS598JHM: Advanced NLP CS598JHM: Advanced NLP 21 22.

Lecture 2: Conjugate Priors

The Beta-Binomial Distribution Introduction Bayesian Derivation

The Exponential Family 1 Definition

Machine Learning Conjugate Priors and Monte Carlo Methods

9. Binomial Distribution

Chapter 6 Continuous Random Variables and Probability

3.2.3 Binomial Distribution

Solutions to the Exercises

Chapter 5 Sections

Polynomial Singular Value Decompositions of a Family of Source-Channel Models

A Note on Skew-Normal Distribution Approximation to the Negative Binomal Distribution

A Compendium of Conjugate Priors

Bayesian Filtering: from Kalman Filters to Particle Filters, and Beyond ZHE CHEN