Introduc)On to Bayesian Methods Lecture 10

Introduc)on to Bayesian methods Lecture 10 David Sontag New York University Slides adapted from Luke Zettlemoyer, Carlos Guestrin, Dan Klein, and Vibhav Gogate Bayesian learning • Bayesian learning uses probability to model data and quan+fy uncertainty of predicons – Facilitates incorporaon of prior knowledge – Gives op;mal predic;ons • Allows for decision-theore;c reasoning Your first consul;ng job • A billionaire from the suburbs of Manhaan asks you a ques;on: – He says: I have thumbtack, if I flip it, what’s the probability it will fall with the nail up? – You say: Please flip it a few ;mes: – You say: The probability is: • P(heads) = 3/5 – He says: Why??? – You say: Because… Outline of lecture • Review of probability • Maximum likelihood es;maon 2 examples of Bayesian classifiers: • Naïve Bayes • Logisc regression Random Variables • A random variable is some aspect of the world about which we (may) have uncertainty – R = Is it raining? – D = How long will it take to drive to work? – L = Where am I? • We denote random variables with capital letters • Random variables have domains – R in {true, false} (sometimes write as {+r, ¬r}) – D in [0, ∞) – L in possible locations, maybe {(0,0), (0,1), …} Probability Distributions • Discrete random variables have distributions T P W P warm 0.5 sun 0.6 cold 0.5 rain 0.1 fog 0.3 meteor 0.0 • A discrete distribution is a TABLE of probabilities of values • The probability of a state (lower case) is a single number • Must have: Joint Distributions • A joint distribution over a set of random variables: specifies a real number for each assignment: T W P – How many assignments if n variables with domain sizes d? hot sun 0.4 hot rain 0.1 – Must obey: cold sun 0.2 cold rain 0.3 • For all but the smallest distributions, impractical to write out or estimate – Instead, we make additional assumptions about the distribution Marginal Distributions • Marginal distributions are sub-tables which eliminate variables • Marginalization (summing out): Combine collapsed rows by adding T P hot 0.5 T W P cold 0.5 hot sun 0.4 P (t)= P (t, w) w hot rain 0.1 X cold sun 0.2 W P cold rain 0.3 P (w)= P (t, w) sun 0.6 t X rain 0.4 Conditional Probabilities • A simple relation between joint and conditional probabilities – In fact, this is taken as the definition of a conditional probability T W P hot sun 0.4 hot rain 0.1 cold sun 0.2 cold rain 0.3 Conditional Distributions • Conditional distributions are probability distributions over some variables given fixed values of others Conditional Distributions Joint Distribution W P T W P sun 0.8 hot sun 0.4 rain 0.2 hot rain 0.1 cold sun 0.2 cold rain 0.3 W P sun 0.4 rain 0.6 The Product Rule • Sometimes have conditional distributions but want the joint • Example: D W P D W P wet sun 0.1 wet sun 0.08 W P dry sun 0.9 dry sun 0.72 sun 0.8 wet rain 0.7 wet rain 0.14 rain 0.2 dry rain 0.3 dry rain 0.06 Bayes’ Rule • Two ways to factor a joint distribution over two variables: • Dividing, we get: • Why is this at all helpful? – Let’s us build one conditional from its reverse – Often one conditional is tricky but the other one is simple – Foundation of many practical systems (e.g. ASR, MT) • In the running for most important ML equation! Returning to thumbtack example… • P(Heads) = θ, P(Tails) = 1-θ … • Flips are i.i.d.: D={xi | i=1…n}, P(D | θ ) = ΠiP(xi | θ ) – Independent events – Iden;cally distributed according to Bernoulli distribu;on • Sequence D of αH Heads and αT Tails Called the “likelihood” of the data under the model Maximum Likelihood Es;maon • Data: Observed set D of αH Heads and αT Tails • Hypothesis: Bernoulli distribu;on • Learning: finding θ is an op;mizaon problem – What’s the objec;ve func;on? • MLE: Choose θ to maximize probability of D Brief Article Brief Article Brief Article Brief Article The AuthorBrief Article The Author The Author January 11, 2012 BriefThe Author Article TheJanuary Author 11, 2012 JanuaryJanuary 11, 11, 2012 2012 January 11, 2012 The Author Your first parameter learning algorithm January 11, 2012 θˆ =argmax⇥ˆln=argmaxP ( θ) ln P ( ⇥) ✓ D| ⇥ D| ˆ =argmax ln ( ) ˆ ⇥ ↵ P ⇥ ⇥ˆ =argmax⇥ =argmaxln Pln(P ( ⇥)⇥) ln θ H ⇥ lnD|⇥αH ⇥ ⇥ D|D| ˆ • Set derivave to zero, and solve! ln ⇥αH ⇥ =argmaxαlnHP ( ⇥) ⇥lnαH⇥ dD| d d ln ⇥ d d ↵αH α↵T lnlnPP(( θ⇥)=)= [lnlnθ⇥ H (1(1 ⇥θ)) T] = [αH ln ⇥ + αT ln(1 ⇥)] ln αH d⇥ D| d⇥ − d⇥ − d ⇥ dθ D|d αdHθ αT − d d d ln P ( d⇥)= [ln ⇥ (1 ⇥) ] = [αH ln ⇥ + αT ln(1 ⇥)] αH d⇥ αT d⇥ d⇥ d ln P ( ⇥)=d [ln ⇥ (1 ⇥) ] D|=d [αH ln ⇥ + αT ln(1− ⇥)] − d⇥ D| d⇥ αH −αT d⇥ − ln P ( d ⇥)= [ln ⇥d (1αH ⇥) ]α=T d[αH ln ⇥ + αT ln(1 ⇥)] d⇥ D|ln P ( d⇥⇥)= [ln ⇥ −(1 ⇥) ] =d⇥ [αdH ln ⇥ + αT ln(1d ⇥−)] αH αT d⇥ D| d⇥ − =dα⇥ ln ⇥ + α ln(1− ⇥)= =0 H d⇥ T d⇥ − ⇥ − 1 ⇥ d d d d ααHH αTT − = α =lnαH⇥ + lnα⇥ + αTln(1ln(1⇥)=⇥)= =0=0 H d⇥ d⇥ T d⇥ d⇥ − − ⇥⇥ −− 11 ⇥⇥ −− 1 1 1 1 1 1 Data L(✓; )=lnP ( ✓) D D| ✓ What if I have prior beliefs? • Billionaire says: Wait, I know that the thumbtack is “close” to 50-50. What can you do for me now? • You say: I can learn it the Bayesian way… • Rather than es;mang a single θ, we obtain a distribu;on over possible values of θ In the beginning After observations Observe flips Pr(✓) Pr(✓ ) e.g.: {tails, tails} |D ✓ ✓ Brief ArticleBrief Article The Author The Author January 11, 2012January 11, 2012 ⌅ˆ =argmax ln P ( ⌅) ⌅ˆ =argmax ln P ( ⌅) ⇤ D| ⇤ D| ln ⌅αH ln ⌅αH d d d ln P ( ⌅)= [ln ⌅αH (1 ⌅)αT ] = [α ln ⌅ + α ln(1 ⌅)] d d d⌅ D| d⌅ d − d⌅ H T − ln P ( ⌅)= [ln ⌅αH (1 ⌅)αT ] = [α ln ⌅ + α ln(1 ⌅)] d⌅ D| d⌅ − d⌅ H T − d d αH αT = αH ln ⌅ + αT ln(1 ⌅)= =0 d d d⌅ αH d⌅ αT− ⌅ − 1 ⌅ = αH ln ⌅ + αT ln(1 ⌅)= =0 − d⌅ d⌅ − ⌅ −2N1⇥2 ⌅ ⇥ 2e− − P (mistake) ⇤ ⇤ 2N⇥2 ⇥ 2e− P (mistake) ⇤ ⇤ ln ⇥ ln 2 2N⇤2 Bayesian Learning ⇤ − Prior • Use Bayes’ rule! 2 Data Likelihood ln ⇥ ln 2 2N⇤ ln(2/⇥) ⇤ − N ⇤ 2⇤2 Posterior ln(2/⇥) N ln(2/0.05) 3.8 ⇤ 2⇤2N Normalization= 190 ⇤ 2 0.12 ⌅ 0.02 ⇥ ln(2• /Or0. equivalently:05) 3.8 N • For uniform priors, this= 190 reducesP (⌅) to 1 ⇤ 2 0.12 ⌅ 0.02 ⇧ ⇥maximum likelihood estimation! P (⌅) 1 P (⌅ ) P ( ⌅) ⇧ |D ⇧ D| 1 1 Bayesian Learning for Thumbtacks Likelihood: • What should the prior be? – Represent expert knowledge – Simple posterior form • For binary variables, commonly used prior is the Beta distribu;on: Brief Article Brief Article The Author Brief Article JanuaryThe 11, Author 2012 The Author January 11, 2012 January 11, 2012 ⇧ˆ =argmax ln P ( ⇧) ˆ ⌅ D| ⌅ =argmax ln P ( ˆ ⌅) ⌅ α D|⌅ =argmax ln P ( ⌅) ln ⇧ H ⌅ D| ln ⌅αH ln ⌅αH d d αH αT d lndP ( ⇧)= [dln ⇧ (1 ⇧) ] = d[αH ln ⇧ + αT ln(1 ⇧)] d⇧ D| d⇧ dαH − αT d⇧ d− ln P ( ⌅)= [ln ⌅ ln(1P ( ⌅)⌅)=] = [ln[⌅ααHH (1ln ⌅ +⌅)αTT]ln(1= [α⌅)]ln ⌅ + α ln(1 ⌅)] d⌅ D| d⌅ d⌅ −D| d⌅d⌅ − d⌅− H T − d d αH αT = αH lnd ⇧ + αT dln(1 ⇧)=d α d α =0 α α = αd⇧ ln ⌅ + αd⇧ ln(1=−α ⌅)=ln ⌅⇧+Hα− 1 ln(1T⇧ =0⌅)= H T =0 H d⌅ T d⌅ H−d⌅ ⌅ T−d⌅1− ⌅ − ⌅ − 1 ⌅ 2 − − 2N⇤ 2 ⇤ 2e− 2N⇤2 P (mistake) 2N⇤ ⇤⇥ 2e− ⇤ P (mistake)⇥ 2e− P (mistake) ⇤ ⇤ ⇤ ⇤ ln ⇤ ln 2 2N⌅2 2 ln ⇥ ln 2 2N⇤2 ln ⇥ ln 2 2N⇤ ⇤ ⇤ −− ⇤ − Beta prior distribu;on – P(ln(2/⇥) θ) ln(2ln(2/⇤/)⇥) N N N 2 ⇤ 2⇤2 ⇤⇤ 2⌅2⇤2 ln(2/0.05) 3.8 ln(2ln(2/0/.05)0.05) 3.38N.8 2 = 190 N N =⇤= 190 1902 0.1 ⌅ 0.02 ⇤ ⇤2 2 0.102.12⌅⌅0.002.02 ⇥ ⇥ ⇥ P (⌅) 1 P (P⇧()⌅) 1 1 ⇧ • Since the Beta distribu;on is ⇧⇧ conjugate to the Bernoulli distribu;on, the posterior distribu;on has a par;cularly simple form: P (⌅ ) P ( ⌅) |D ⇧ D| P (P⇧(⌅ ) ) PP( ( ⇧)⌅) |D|D⇧⇧ D|D| α α ⇥ 1 ⇥ 1 P (⌅ ) ⌅ H (1 ⌅) T ⌅ H − (1 ⌅) T − |D ⇧ − − α α ⇥ 1 ⇥ 1 α +⇥ 1 α +⇥t+1 P (⌅ ) ⌅ H (1 ⌅) T ⌅ H − (1 ⌅) T − = ⌅ H H − (1 ⌅) T α |D α⇧ ⇥ 1− ⇥ 1 α− +⇥ 1 α +⇥t+1 − P (⇧ ) ⇧ H (1 ⇧) T ⇧ H − (1 ⇧) T − = ⇧ H H − (1 ⇧) T = Beta(α +⇥ , α +⇥ ) |D ⇧ − − − H H T T 1 1 1 Using Bayesian inference for predic;on • We now have a distribuon over parameters • For any specific f, a func;on of interest, compute the expected value of f: • Integral is oien hard to compute • As more data is observed, prior is more concentrated • MAP (Maximum a posteriori approximaon): use most likely parameter to approximate the expectaon What about con;nuous variables? • Billionaire says: If I am measuring a con;nuous variable, what can you do for me? • You say: Let me tell you about Gaussians… Some proper;es of Gaussians • Affine transformaon (mul;plying by scalar and adding a constant) are Gaussian – X ~ N(µ,σ2) – Y = aX + b Y ~ N(aµ+b,a2σ2) • Sum of Gaussians is Gaussian 2 – X ~ N(µX,σ X) 2 – Y ~ N(µY,σ Y) 2 2 – Z = X+Y Z ~ N(µX+µY, σ X+σ Y) • Easy to differen;ate, as we will see soon! x Learning a Gaussian i Exam i = Score • Collect a bunch of data 0 85 – Hopefully, i.i.d.

Load more