Lecture 2: Conjugate Priors

CS598JHM: Advanced NLP (Spring ʼ10) The binomial distribution If p is the probability of heads, the probability of getting exactly k heads in n independent yes/no trials is given by Lecture 2: the binomial distribution Bin(n,p): Conjugate priors n k n k P (k heads) = p (1 p) − k − n! k n k = p (1 p) − k!(n k)! − − Julia Hockenmaier Expectation E(Bin(n,p)) = np [email protected] 3324 Siebel Center Variance var(Bin(n,p)) = np(1-p) http://www.cs.uiuc.edu/class/sp10/cs598jhm CS598JHM: Advanced NLP 2 Binomial likelihood Parameter estimation What distribution does p (probability of heads) have, Given a set of data D=HTTHTT, what is the probability ! of given that the data D consists of #H heads and #T tails? heads? Likelihood L(e;D=(#Heads,#Tails)) for binomial distribution 0.007 - Maximum likelihood estimation (MLE): L(e,(5,5)) L(e,(3,7)) Use the ! which has the highest likelihood P(D| !). 0.006 L(e,(2,8)) P (x = H D)=P (x = H θ)withθ = arg max P (D θ) 0.005 | | θ | 0.004 - Bayesian estimation: 0.003 Compute the expectation of ! given D: 1 0.002 P (x = H D)= P (x = H θ)P (θ D)dθ = E[θ D] | 0 | | | 0.001 0 0 0.2 0.4 0.6 0.8 1 CS598JHM: Advancede NLP CS598JHM: Advanced NLP 3 4 Maximum likelihood estimation Bayesian statistics - Data D provides evidence for or against our beliefs. - Maximum likelihood estimation (MLE): We update our belief ! based on the evidence we see: find ! which maximizes likelihood P(D | !). Prior Likelihood P (θ)P (D θ) P (θ D)= | | P (θ)P (D θ)dθ θ∗ = arg max P (D θ) θ | Posterior | = arg max θH (1 θ)T Marginal Likelihood (=P(D)) θ − H = H + T CS598JHM: Advanced NLP CS598JHM: Advanced NLP 5 6 Bayesian estimation In search of a prior... Given a prior P(!) and a likelihood P(D|!), We would like something of the form: what is the posterior P(! |D)? P (θ) θa(1 θ)b How do we choose the prior P(!)? ∝ − - The posterior is proportional to prior x likelihood: But -- this looks just like the binomial: P(! |D)! P(!) P(D|!) n k n k P (k heads) = p (1 p) − k − - The likelihood of a binomial is: H T P(D|!) = ! (1-!) n! k n k = p (1 p) − k!(n k)! − - If prior P(!) is proportional to powers of ! and (1-!), − posterior will also be proportional to powers of ! and (1-!): P(!) ! ! a(1-!)b …. except that k is an integer and ! is a real with 0<! <1. " P(! |D)! ! a(1-!)b !H(1-!)T = !a+H(1-!)b+T CS598JHM: Advanced NLP CS598JHM: Advanced NLP 7 8 The Gamma function The Gamma function The Gamma function "(x) is the generalization of the K(x) function factorial x! (or rather (x-1)!) to the reals: 25 ∞ α 1 x 20 Γ(α)= x − e− dx for α > 0 0 15 For x >1, "(x) = (x-1)"(x-1). 10 For positive integers, "(x) = (x-1)! 5 0 0 1 2 3 4 5 CS598JHM: Advanced NLP CS598JHM: Advanced NLP 9 10 The Beta distribution Beta(",#) with " >1, # >1 A random variable X (0 < x < 1) has a Beta distribution with Unimodal (hyper)parameters " (" > 0) and # (# > 0) if X has a continuous Beta Distribution Beta(_, `) distribution with probability density function 7 Beta(1.5,1.5) Beta(3,1.5) 6 Beta(3,3) Γ(α + β) α 1 β 1 Beta(20,20) P (x α, β)= x − (1 x) − Beta(3,20) | Γ(α)Γ(β) − 5 4 The first term is a normalization factor (to obtain a distribution) 3 1 α 1 β 1 Γ(α + β) 2 x − (1 x) − dx = 0 − Γ(α)Γ(β) 1 Expectation: α α+β 0 0 0.2 0.4 0.6 0.8 1 ! CS598JHM: Advanced NLP CS598JHM: Advanced NLP 11 12 Beta(",#) with " <1, # <1 Beta(",#) with "=# U-shaped Symmetric. #=$=1: uniform Beta Distribution Beta(_, `) Beta Distribution Beta(_, `) 6 3.5 Beta(0.1,0.1) Beta(0.1,0.1) Beta(0.1,0.5) Beta(1,1) 5 Beta(0.5,0.5) 3 Beta(2,2) 2.5 4 2 3 1.5 2 1 1 0.5 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 ! ! CS598JHM: Advanced NLP CS598JHM: Advanced NLP 13 14 Beta(",#) with "<1, # >1 Beta(",#) with " = 1, # >1 Strictly decreasing # = 1, 1< $ < 2: strictly concave. Beta Distribution Beta(_, `) # = 1, $ = 2: straight line 8 # = 1, $ > 2: strictly convex Beta(0.1,1.5) Beta(0.5,1.5) Beta Distribution Beta(_, `) 7 Beta(0.5,2) 3.5 Beta(1,1.5) 6 Beta(1,2) 3 Beta(1,3) 5 2.5 4 2 3 1.5 2 1 1 0 0.5 0 0.2 0.4 0.6 0.8 1 ! 0 0 0.2 0.4 0.6 0.8 1 CS598JHM: Advanced NLP CS598JHM: Advanced NLP 15 16 ! Beta as prior for binomial So, what do we predict? Given a prior P(! |#,$) = Beta(#,$), and data D=(H,T), Our Bayesian estimate for the next coin flip P(x=1 | D): what is our posterior? 1 P (θ α, β,H,T) P (H, T θ)P (θ α, β) P (x = H D)= P (x = H θ)P (θ D)dθ | ∝ | | | 0 | | 1 H T α 1 β 1 = θP (θ D)dθ θ (1 θ) θ − (1 θ) − | ∝ − − 0 = E[θ D] | H+α 1 T +β 1 = E[Beta(H + α,T + β)] = θ − (1 θ) − − H + α = H + α + T + β With normalization Γ(H + α + T + β) H+α 1 T +β 1 P (θ α, β,H,T)= θ − (1 θ) − | Γ(H + α)Γ(T + β) − = Beta(α + H, β + T ) CS598JHM: Advanced NLP CS598JHM: Advanced NLP 17 18 Conjugate priors Multinomials: Dirichlet prior The beta distribution is a conjugate prior to the binomial: Multinomial distribution: the resulting posterior is also a beta distribution. Probability of observing each possible outcome ci exactly Xi times in a sequence of n yes/no trials: All members of the exponential family of distributions have n! N P (X =x ,...,X =x )= θx1 θxK if x = n conjugate priors. 1 i K k x ! x ! 1 ··· K i 1 K i=1 ··· Examples: - Multinomial: conjugate prior = Dirichlet Dirichlet prior: - Gaussian: conjugate prior = Gaussian Γ(α1 + ... + αk) αk 1 Dir(θ α1,...αk)= θk − | Γ(α1)...Γ(αk) k=1 CS598JHM: Advanced NLP CS598JHM: Advanced NLP 19 20 More about conjugate priors Todayʼs reading - We can interpret the hyperparameters as “pseudocounts” - Bishop, Pattern Recognition and Machine Learning, Ch. 2 - Sequential estimation (updating counts after each observation) gives same results as batch estimation - Add-one smoothing (Laplace smoothing) = uniform prior - On average, more data leads to a sharper posterior (sharper = lower variance) CS598JHM: Advanced NLP CS598JHM: Advanced NLP 21 22.

Lecture 2: Conjugate Priors

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support