<<

CS598JHM: Advanced NLP (Spring ʼ10) The

If p is the of heads, the probability of getting exactly k heads in n independent yes/no trials is given by Lecture 2: the binomial distribution Bin(n,p): Conjugate priors n k n k P (k heads) = p (1 p) − k − ￿ ￿ n! k n k = p (1 p) − k!(n k)! − − Julia Hockenmaier Expectation E(Bin(n,p)) = np [email protected] 3324 Siebel Center var(Bin(n,p)) = np(1-p)

http://www.cs.uiuc.edu/class/sp10/cs598jhm CS598JHM: Advanced NLP 2

Binomial likelihood Parameter estimation

What distribution does p (probability of heads) have, Given a set of data D=HTTHTT, what is the probability ! of given that the data D consists of #H heads and #T tails? heads?

Likelihood L(e;D=(#Heads,#Tails)) for binomial distribution 0.007 - Maximum likelihood estimation (MLE): L(e,(5,5)) L(e,(3,7)) Use the ! which has the highest likelihood P(D| !). 0.006 L(e,(2,8)) P (x = H D)=P (x = H θ)withθ = arg max P (D θ) 0.005 | | θ | 0.004 - Bayesian estimation:

0.003 Compute the expectation of ! given D: 1 0.002 P (x = H D)= P (x = H θ)P (θ D)dθ = E[θ D] | 0 | | | 0.001 ￿

0 0 0.2 0.4 0.6 0.8 1 CS598JHM: Advancede NLP CS598JHM: Advanced NLP 3 4 Maximum likelihood estimation Bayesian

- Data D provides evidence for or against our beliefs. - Maximum likelihood estimation (MLE): We update our belief ! based on the evidence we see: find ! which maximizes likelihood P(D | !). Prior Likelihood P (θ)P (D θ) P (θ D)= | | P (θ)P (D θ)dθ θ∗ = arg max P (D θ) θ | Posterior | = arg max θH (1 θ)T (=P(D)) θ − ￿ H = H + T

CS598JHM: Advanced NLP CS598JHM: Advanced NLP 5 6

Bayesian estimation In search of a prior...

Given a prior P(!) and a likelihood P(D|!), We would like something of the form: what is the posterior P(! |D)? P (θ) θa(1 θ)b How do we choose the prior P(!)? ∝ − - The posterior is proportional to prior x likelihood: But -- this looks just like the binomial: P(! |D)! P(!) P(D|!) n k n k P (k heads) = p (1 p) − k − - The likelihood of a binomial is: ￿ ￿ H T P(D|!) = ! (1-!) n! k n k = p (1 p) − k!(n k)! − - If prior P(!) is proportional to powers of ! and (1-!), − posterior will also be proportional to powers of ! and (1-!): P(!) ! ! a(1-!)b …. except that k is an integer and ! is a real with 0

CS598JHM: Advanced NLP CS598JHM: Advanced NLP 7 8 The Gamma function The Gamma function

The Gamma function "(x) is the generalization of the K(x) function factorial x! (or rather (x-1)!) to the reals: 25

∞ α 1 x 20 Γ(α)= x − e− dx for α > 0 ￿0 15 For x >1, "(x) = (x-1)"(x-1). 10 For positive integers, "(x) = (x-1)! 5

0 0 1 2 3 4 5

CS598JHM: Advanced NLP CS598JHM: Advanced NLP 9 10

The Beta(",#) with " >1, # >1

A X (0 < x < 1) has a Beta distribution with Unimodal (hyper)parameters " (" > 0) and # (# > 0) if X has a continuous Beta Distribution Beta(_, `) distribution with probability density function 7 Beta(1.5,1.5) Beta(3,1.5) 6 Beta(3,3) Γ(α + β) α 1 β 1 Beta(20,20) P (x α, β)= x − (1 x) − Beta(3,20) | Γ(α)Γ(β) − 5

4 The first term is a normalization factor (to obtain a distribution) 3 1 α 1 β 1 Γ(α + β) 2 x − (1 x) − dx = 0 − Γ(α)Γ(β) ￿ 1 Expectation: α α+β 0 0 0.2 0.4 0.6 0.8 1 !

CS598JHM: Advanced NLP CS598JHM: Advanced NLP 11 12 Beta(",#) with " <1, # <1 Beta(",#) with "=#

U-shaped Symmetric. #=$=1: uniform Beta Distribution Beta(_, `) Beta Distribution Beta(_, `) 6 3.5 Beta(0.1,0.1) Beta(0.1,0.1) Beta(0.1,0.5) Beta(1,1) 5 Beta(0.5,0.5) 3 Beta(2,2)

2.5 4 2 3 1.5 2 1

1 0.5

0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 ! !

CS598JHM: Advanced NLP CS598JHM: Advanced NLP 13 14

Beta(",#) with "<1, # >1 Beta(",#) with " = 1, # >1

Strictly decreasing # = 1, 1< $ < 2: strictly concave. Beta Distribution Beta(_, `) # = 1, $ = 2: straight line 8 # = 1, $ > 2: strictly convex Beta(0.1,1.5) Beta(0.5,1.5) Beta Distribution Beta(_, `) 7 Beta(0.5,2) 3.5 Beta(1,1.5) 6 Beta(1,2) 3 Beta(1,3) 5 2.5 4 2 3 1.5 2

1 1

0 0.5 0 0.2 0.4 0.6 0.8 1 ! 0 0 0.2 0.4 0.6 0.8 1 CS598JHM: Advanced NLP CS598JHM: Advanced NLP 15 16 ! Beta as prior for binomial So, what do we predict?

Given a prior P(! |#,$) = Beta(#,$), and data D=(H,T), Our Bayesian estimate for the next coin flip P(x=1 | D): what is our posterior? 1 P (θ α, β,H,T) P (H, T θ)P (θ α, β) P (x = H D)= P (x = H θ)P (θ D)dθ | ∝ | | | 0 | | ￿ 1 H T α 1 β 1 = θP (θ D)dθ θ (1 θ) θ − (1 θ) − | ∝ − − ￿0 = E[θ D] | H+α 1 T +β 1 = E[Beta(H + α,T + β)] = θ − (1 θ) − − H + α = H + α + T + β With normalization Γ(H + α + T + β) H+α 1 T +β 1 P (θ α, β,H,T)= θ − (1 θ) − | Γ(H + α)Γ(T + β) − = Beta(α + H, β + T ) CS598JHM: Advanced NLP CS598JHM: Advanced NLP 17 18

Conjugate priors Multinomials: Dirichlet prior

The beta distribution is a to the binomial: : the resulting posterior is also a beta distribution. Probability of observing each possible ci exactly Xi times in a sequence of n yes/no trials: All members of the of distributions have n! N P (X =x ,...,X =x )= θx1 θxK if x = n conjugate priors. 1 i K k x ! x ! 1 ··· K i 1 K i=1 ··· ￿ Examples: - Multinomial: conjugate prior = Dirichlet Dirichlet prior: - Gaussian: conjugate prior = Gaussian

Γ(α1 + ... + αk) αk 1 Dir(θ α1,...αk)= θk − | Γ(α1)...Γ(αk) k￿=1

CS598JHM: Advanced NLP CS598JHM: Advanced NLP 19 20 More about conjugate priors Todayʼs reading

- We can interpret the as “pseudocounts” - Bishop, and Machine Learning, Ch. 2

- Sequential estimation (updating counts after each observation) gives same results as batch estimation

- Add-one smoothing (Laplace smoothing) = uniform prior

- On average, more data leads to a sharper posterior (sharper = lower variance)

CS598JHM: Advanced NLP CS598JHM: Advanced NLP 21 22