Introducon to Bayesian methods Lecture 10

David Sontag New York University

Slides adapted from Luke Zettlemoyer, Carlos Guestrin, Dan Klein, and Vibhav Gogate Bayesian learning

• Bayesian learning uses probability to model data and quanfy uncertainty of predicons – Facilitates incorporaon of prior knowledge – Gives opmal predicons • Allows for decision-theorec reasoning Your first consulng job • A billionaire from the suburbs of Manhaan asks you a queson: – He says: I have thumbtack, if I flip it, what’s the probability it will fall with the nail up? – You say: Please flip it a few mes:

– You say: The probability is: • P(heads) = 3/5 – He says: Why??? – You say: Because… Outline of lecture

• Review of probability • Maximum likelihood esmaon

2 examples of Bayesian classifiers: • Naïve Bayes • Logisc regression Random Variables

• A random variable is some aspect of the world about which we (may) have uncertainty – R = Is it raining? – D = How long will it take to drive to work? – L = Where am I?

• We denote random variables with capital letters

• Random variables have domains – R in {true, false} (sometimes write as {+r, ¬r}) – D in [0, ∞) – L in possible locations, maybe {(0,0), (0,1), …} Probability Distributions

• Discrete random variables have distributions

T P W P warm 0.5 sun 0.6 cold 0.5 rain 0.1 fog 0.3 meteor 0.0

• A discrete distribution is a TABLE of probabilities of values • The probability of a state (lower case) is a single number

• Must have: Joint Distributions

• A joint distribution over a set of random variables: specifies a real number for each assignment:

T W P – How many assignments if n variables with domain sizes d? hot sun 0.4 hot rain 0.1 – Must obey: cold sun 0.2 cold rain 0.3

• For all but the smallest distributions, impractical to write out or estimate – Instead, we make additional assumptions about the distribution Marginal Distributions

• Marginal distributions are sub-tables which eliminate variables • Marginalization (summing out): Combine collapsed rows by adding

T P hot 0.5 T W P cold 0.5 hot sun 0.4 P (t)= P (t, w) w hot rain 0.1 X cold sun 0.2 W P cold rain 0.3 P (w)= P (t, w) sun 0.6 t X rain 0.4 Conditional Probabilities

• A simple relation between joint and conditional probabilities – In fact, this is taken as the definition of a

T W P hot sun 0.4 hot rain 0.1 cold sun 0.2 cold rain 0.3 Conditional Distributions

• Conditional distributions are probability distributions over some variables given fixed values of others

Conditional Distributions Joint Distribution

W P T W P sun 0.8 hot sun 0.4 rain 0.2 hot rain 0.1 cold sun 0.2 cold rain 0.3 W P sun 0.4 rain 0.6 The Product Rule

• Sometimes have conditional distributions but want the joint

• Example:

D W P D W P wet sun 0.1 wet sun 0.08 W P dry sun 0.9 dry sun 0.72 sun 0.8 wet rain 0.7 wet rain 0.14 rain 0.2 dry rain 0.3 dry rain 0.06 Bayes’ Rule

• Two ways to factor a joint distribution over two variables:

• Dividing, we get:

• Why is this at all helpful? – Let’s us build one conditional from its reverse – Often one conditional is tricky but the other one is simple – Foundation of many practical systems (e.g. ASR, MT)

• In the running for most important ML equation! Returning to thumbtack example…

• P(Heads) = θ, P(Tails) = 1-θ …

• Flips are i.i.d.: D={xi | i=1…n}, P(D | θ ) = ΠiP(xi | θ ) – Independent events – Idencally distributed according to Bernoulli distribuon

• Sequence D of αH Heads and αT Tails

Called the “likelihood” of the data under the model Maximum Likelihood Esmaon

• Data: Observed set D of αH Heads and αT Tails • Hypothesis: Bernoulli distribuon • Learning: finding θ is an opmizaon problem – What’s the objecve funcon?

• MLE: Choose θ to maximize probability of D Brief Article Brief Article Brief Article Brief Article The AuthorBrief Article The Author The Author January 11, 2012 BriefThe Author Article TheJanuary Author 11, 2012 JanuaryJanuary 11, 11, 2012 2012 January 11, 2012 The Author Your first parameter learning January 11, 2012 ˆ =argmax⇥ˆln=argmaxP ( ) ln P ( ⇥) ✓ D| ⇥ D| ˆ =argmax ln ( ) ˆ ⇥ ↵ P ⇥ ⇥ˆ =argmax⇥ =argmaxln Pln(P ( ⇥)⇥) ln H ⇥ lnD|⇥H ⇥ ⇥ D|D| ˆ • Set derivave to zero, and solve! ln ⇥H ⇥ =argmaxlnHP ( ⇥) ⇥lnH⇥ dD| d d ln ⇥ d d ↵H ↵T lnlnPP(( ⇥)=)= [lnln⇥ H (1(1 ⇥)) T] = [H ln ⇥ + T ln(1 ⇥)] ln H d⇥ D| d⇥ d⇥ d ⇥ d D|d dH T d d d ln P ( d⇥)= [ln ⇥ (1 ⇥) ] = [H ln ⇥ + T ln(1 ⇥)] H d⇥ T d⇥ d⇥ d ln P ( ⇥)=d [ln ⇥ (1 ⇥) ] D|=d [H ln ⇥ + T ln(1 ⇥)] d⇥ D| d⇥ H T d⇥ ln P ( d ⇥)= [ln ⇥d (1H ⇥) ]=T d[H ln ⇥ + T ln(1 ⇥)] d⇥ D|ln P ( d⇥⇥)= [ln ⇥ (1 ⇥) ] =d⇥ [dH ln ⇥ + T ln(1d ⇥)] H T d⇥ D| d⇥ =d⇥ ln ⇥ + ln(1 ⇥)= =0 H d⇥ T d⇥ ⇥ 1 ⇥ d d d d HH TT = =lnH⇥ + ln⇥ + Tln(1ln(1⇥)=⇥)= =0=0 H d⇥ d⇥ T d⇥ d⇥ ⇥⇥ 11 ⇥⇥

1 1 1 1 1

1 Data

L(✓; )=lnP ( ✓) D D|

✓ What if I have prior beliefs? • Billionaire says: Wait, I know that the thumbtack is “close” to 50-50. What can you do for me now? • You say: I can learn it the Bayesian way… • Rather than esmang a single θ, we obtain a distribuon over possible values of θ

In the beginning After observations Observe flips Pr(✓) Pr(✓ ) e.g.: {tails, tails} |D

✓ ✓ Brief ArticleBrief Article

The Author The Author January 11, 2012January 11, 2012

⌅ˆ =argmax ln P ( ⌅) ⌅ˆ =argmax ln P ( ⌅) ⇤ D| ⇤ D| ln ⌅H ln ⌅H d d d ln P ( ⌅)= [ln ⌅H (1 ⌅)T ] = [ ln ⌅ + ln(1 ⌅)] d d d⌅ D| d⌅ d d⌅ H T ln P ( ⌅)= [ln ⌅H (1 ⌅)T ] = [ ln ⌅ + ln(1 ⌅)] d⌅ D| d⌅ d⌅ H T d d H T = H ln ⌅ + T ln(1 ⌅)= =0 d d d⌅ H d⌅ T ⌅ 1 ⌅ = H ln ⌅ + T ln(1 ⌅)= =0 d⌅ d⌅ ⌅ 2N1⇥2 ⌅ ⇥ 2e P (mistake) ⇤ ⇤ 2N⇥2 ⇥ 2e P (mistake) ⇤ ⇤ ln ⇥ ln 2 2N⇤2 Bayesian Learning ⇤ Prior • Use Bayes’ rule! 2 Data Likelihood ln ⇥ ln 2 2N⇤ ln(2/⇥) ⇤ N ⇤ 2⇤2 Posterior ln(2/⇥) N ln(2/0.05) 3.8 ⇤ 2⇤2N Normalization= 190 ⇤ 2 0.12 ⌅ 0.02 ⇥ ln(2• /Or0. equivalently:05) 3.8 N • For uniform priors, this= 190 reducesP (⌅) to 1 ⇤ 2 0.12 ⌅ 0.02 ⇧ ⇥maximum likelihood estimation! P (⌅) 1 P (⌅ ) P ( ⌅) ⇧ |D ⇧ D|

1

1 Bayesian Learning for Thumbtacks

Likelihood:

• What should the prior be? – Represent expert knowledge – Simple posterior form

• For binary variables, commonly used prior is the Beta distribuon: Brief Article Brief Article The Author Brief Article JanuaryThe 11, Author 2012 The Author January 11, 2012 January 11, 2012

⇧ˆ =argmax ln P ( ⇧) ˆ ⌅ D| ⌅ =argmax ln P ( ˆ ⌅) ⌅ D|⌅ =argmax ln P ( ⌅) ln ⇧ H ⌅ D| ln ⌅H ln ⌅H d d H T d lndP ( ⇧)= [dln ⇧ (1 ⇧) ] = d[H ln ⇧ + T ln(1 ⇧)] d⇧ D| d⇧ dH T d⇧ d ln P ( ⌅)= [ln ⌅ ln(1P ( ⌅)⌅)=] = [ln[⌅HH (1ln ⌅ +⌅)TT]ln(1= [⌅)]ln ⌅ + ln(1 ⌅)] d⌅ D| d⌅ d⌅ D| d⌅d⌅ d⌅ H T d d H T = H lnd ⇧ + T dln(1 ⇧)=d d =0 = d⇧ ln ⌅ + d⇧ ln(1= ⌅)=ln ⌅⇧+H 1 ln(1T⇧ =0⌅)= H T =0 H d⌅ T d⌅ Hd⌅ ⌅ Td⌅1 ⌅ ⌅ 1 ⌅ 2 2N⇤ 2 ⇤ 2e 2N⇤2 P (mistake) 2N⇤ ⇤⇥ 2e ⇤ P (mistake)⇥ 2e P (mistake) ⇤ ⇤ ⇤ ⇤ ln ⇤ ln 2 2N⌅2 2 ln ⇥ ln 2 2N⇤2 ln ⇥ ln 2 2N⇤ ⇤ ⇤ ⇤ Beta prior distribuon – P(ln(2/⇥) θ) ln(2ln(2/⇤/)⇥) N N N 2 ⇤ 2⇤2 ⇤⇤ 2⌅2⇤2 ln(2/0.05) 3.8 ln(2ln(2/0/.05)0.05) 3.38N.8 2 = 190 N N =⇤= 190 1902 0.1 ⌅ 0.02 ⇤ ⇤2 2 0.102.12⌅⌅0.002.02 ⇥ ⇥ ⇥ P (⌅) 1 P (P⇧()⌅) 1 1 ⇧ • Since the Beta distribuon is ⇧⇧ conjugate to the Bernoulli distribuon, the posterior distribuon has a parcularly simple form: P (⌅ ) P ( ⌅) |D ⇧ D| P (P⇧(⌅ ) ) PP( ( ⇧)⌅) |D|D⇧⇧ D|D| ⇥ 1 ⇥ 1 P (⌅ ) ⌅ H (1 ⌅) T ⌅ H (1 ⌅) T |D ⇧ ⇥ 1 ⇥ 1 +⇥ 1 +⇥t+1 P (⌅ ) ⌅ H (1 ⌅) T ⌅ H (1 ⌅) T = ⌅ H H (1 ⌅) T |D ⇧ ⇥ 1 ⇥ 1 +⇥ 1 +⇥t+1 P (⇧ ) ⇧ H (1 ⇧) T ⇧ H (1 ⇧) T = ⇧ H H (1 ⇧) T = Beta( +⇥ , +⇥ ) |D ⇧ H H T T

1

1 1 Using for predicon

• We now have a distribuon over parameters • For any specific f, a funcon of interest, compute the expected value of f:

• Integral is oen hard to compute • As more data is observed, prior is more concentrated • MAP (Maximum a posteriori approximaon): use most likely parameter to approximate the expectaon What about connuous variables?

• Billionaire says: If I am measuring a connuous variable, what can you do for me? • You say: Let me tell you about Gaussians… Some properes of Gaussians

• Affine transformaon (mulplying by scalar and adding a constant) are Gaussian – X ~ N(µ,σ2) – Y = aX + b  Y ~ N(aµ+b,a2σ2)

• Sum of Gaussians is Gaussian 2 – X ~ N(µX,σ X) 2 – Y ~ N(µY,σ Y) 2 2 – Z = X+Y  Z ~ N(µX+µY, σ X+σ Y)

• Easy to differenate, as we will see soon! x Learning a Gaussian i Exam i = Score • Collect a bunch of data 0 85 – Hopefully, i.i.d. samples 1 95 – e.g., exam scores 2 100 3 12 • Learn parameters … … – µ (“mean”) 99 89 – σ (“”) MLE for Gaussian:

• Prob. of i.i.d. samples D={x1,…,xN}: µMLE, MLE =argmaxP ( µ, ) µ, D|

• Log-likelihood of data:

2 Your second learning algorithm: MLE for mean of a Gaussian • What’s MLE for mean?

µMLEµMLE, MLE, MLE=argmax=argmaxPP(( µ,µ,)) µ,µ, D| µMLE, MLE =argmaxP ( D|µ, ) µ, D| N N N (xi µ) = (x(ix µµ)) =0 = = i2 =0=0 i=1 22 i=1iX=1 XXN = x +- Nµ =0 i =1 Xi N N (x µ)2 = + i =0 3 =1 Xi

2

2

2 MLE for variance • Again, set derivave to zero:

µMLE, MLE =argmaxP ( µ, ) µ, D|

N µMLE, MLE(x=argmaxµ) P ( µ, ) = i =0µ, D| 2 =1 Xi NN N (x(xi µ)2µ) = = + i =0=0 32 i=1=1 XXi

2

2 Learning Gaussian parameters

• MLE:

• MLE for the variance of a Gaussian is biased – Expected result of esmaon is not true parameter! – Unbiased variance esmator: Bayesian learning of Gaussian parameters • Conjugate priors – Mean: Gaussian prior – Variance: Wishart Distribuon

• Prior for mean: Outline of lecture

• Review of probability • Maximum likelihood esmaon

2 examples of Bayesian classifiers: • Naïve Bayes • Logisc regression Bayesian Classificaon

• Problem statement:

– Given features X1,X2,…,Xn – Predict a label Y

[Next several slides adapted from: Vibhav Gogate, Jonathan Huang, Luke Zettlemoyer, Carlos Guestrin, and Dan Weld] Example Applicaon

• Digit Recognion

Classifier 5

• X1,…,Xn ∈ {0,1} (Black vs. White pixels) • Y ∈ {0,1,2,3,4,5,6,7,8,9} The Bayes Classifier

• If we had the joint distribuon on X1,…,Xn and Y, could predict using:

– (for example: what is the probability that the image represents a 5 given its pixels?)

• So … How do we compute that? The Bayes Classifier

• Use Bayes Rule!

Likelihood Prior

Normalization Constant

• Why did this help? Well, we think that we might be able to specify how features are “generated” by the class label The Bayes Classifier

• Let’s expand this for our digit recognion task:

• To classify, we’ll simply compute these probabilies, one per class, and predict based on which one is largest Model Parameters

• How many parameters are required to specify the likelihood,

P(X1,…,Xn|Y) ? – (Supposing that each image is 30x30 pixels)

• The problem with explicitly modeling P(X1,…,Xn|Y) is that there are usually way too many parameters: – We’ll run out of space – We’ll run out of me – And we’ll need tons of training data (which is usually not available) Naïve Bayes • Naïve Bayes assumpon: – Features are independent given class:

– More generally:

• How many parameters now? • Suppose X is composed of n binary features The Naïve Bayes Classifier • Given: – Prior P(Y) Y – n condionally independent features X given the class Y – For each Xi, we have likelihood P(X |Y) i X1 X2 Xn • Decision rule:

If certain assumption holds, NB is optimal classifier! (they typically don’t) A Digit Recognizer

• Input: pixel grids

• Output: a digit 0-9

Are the naïve Bayes assumpons realisc here? What has to be learned?

1 0.1 1 0.01 1 0.05 2 0.1 2 0.05 2 0.01 3 0.1 3 0.05 3 0.90 4 0.1 4 0.30 4 0.80 5 0.1 5 0.80 5 0.90 6 0.1 6 0.90 6 0.90 7 0.1 7 0.05 7 0.25 8 0.1 8 0.60 8 0.85 9 0.1 9 0.50 9 0.60 0 0.1 0 0.80 0 0.80 µµMLEMLE, ,⇥⇥MLEMLE=argmax=argmaxPP(( µ,µ,⇥⇥)) µµMLEMLE,,⇥⇥MLEMLE =argmax=argmaxµ,µ,PP(( D|D|µ,µ,⇥⇥)) µ,µ, D|D| NN NN (x(xi i µµ)) = ((xxii µµ)) =0 = 22 =0 == 2⇥⇥ =0=0 i=1i=1 ⇥⇥2 i=1=1X Xi X XNN NN == xxi i++NµNµ=0=0 == xxii ++NµNµ=0=0 i=1i=1 ii=1=1XX XX NN 22 NN NN (x(xi i µ22µ)) =NN + ((xxii µµ)) =0 = + 33 =0 == ⇥++⇥ 3⇥⇥ =0=0 ⇥⇥ i=1i=1 ⇥⇥3 ii=1=1XX XX NN NN 2 1 N N [ ( )]2 1 N N [tjtj i iwwihihi(ixxj22)]j argarg max maxlnln 11 ++ [[ttjj i wwiihhii((xxjj)])] arg max ln + i 22 arg maxwwln ⇥⇥⇥⇥22 + 222⇥⇥ ww ⇥✓⇥✓⇥⇥22 ◆◆ j=1j=1 22P⇥⇥P2 ✓ ◆ jj=1=1XX P ✓ ◆ XX P NN N [ ( )]22 MLE for the parameters of NB N [tjtj i iwwihihi(ixxj22)]j =argmax [[ttjj i wwiihhii((xxjj)])] =argmax i 22 =argmax=argmaxww 222⇥⇥ • ww j=1j=1 22P⇥⇥P2 Given dataset jj=1=1XX P XX P – Count(A=a,B=b) Ã number of examples where A=a and NN N B=b N 22 == arg arg min min [t[jtj wwihihi(ix(xj22)]j)] • MLE for discrete NB, simply: == arg arg min minww [[ttjj wwiihhii((xxjj)])] ww j=1j=1 i i jj=1=1XX iiXX – Prior: XX XX CountCount(Y(Y==yy)) PP(Y(Y==yy)=)=CountCount((YY == yy)) PP((YY == yy)=)= Count(Y = y ) yy0Count(Y = y) y CountCount0 ((YY == yy)) – Observaon distribuon: y00 PP PPCountCount((XXii == x,x, Y Y == yy)) PP((XXii == xx YY == yy)=)= || CountCount((XXii == xx,Y,Y == yy)) xx00 PP

22 22 MLE for the parameters of NB

• Training amounts to, for each of the classes, averaging all of the examples together: µµMLEMLE, ,⇥⇥MLEMLE=argmax=argmaxPP(( µ,µ,⇥⇥)) µMLE, ⇥MLE =argmaxPµ,(µ, µ,D|⇥) µ, D| D| NN N (x(xi i µµ)) = (xi µ) =0 µ , ⇥ = =argmaxP2(2 µ,=0⇥) MLE= MLE 2 ⇥⇥=0 i=1i=1⇥µ, D| i=1 X X X N NN N (x µ) = = i x =0+ Nµ =0 = = +⇥2xi i+=0Nµ =0 i=1xi Nµ i=1i=1 iX=1 X XN X NN 22 = NNNxi + Nµ(x(=0xi i2 µµ)) N= +(xi µ) =0 = i=1 + 33 =0 = +⇥⇥ 3 ⇥⇥=0 ⇥ X i=1i=1⇥ i=1N XX 2 N X (xi µ) = + NN NN =0 2 N1 N 3 [ ( )]2 ⇥ 1 ⇥ [tjtj i iwwihihi(2ixxj)]j argarg max maxlnln1 i=1 ++ [tj i wihi(xj)] arg max ln X+ 22 ww ⇥⇥⇥⇥22 2 22⇥⇥ w ⇥⇥✓2✓ N ◆◆N j=1j=1 2⇥ PP 2 ✓ 1 ◆ j=1 X[tXj Pwihi(xj)] arg max ln + X i w ⇥2 NN 2⇥2 2 ⇥ N j=1 [ ( )]2 MAP esmaon for NB ✓ ◆ [tjtj Pi iwwihihi(2ixxj)]j =argmax=argmax[tXj i wihi(xj)] =argmax 22 Nww 2 22⇥⇥ 2 w j=1[jt=1j 2w⇥ihPiP(xj)] • Given dataset =argmaxj=1 XX Pi w X 2⇥2 – Count(A=a,B=b) j=1Ã number of examples where A=a and NN N P B=b X 22 == arg arg minN min [t[jtj wwihihi(2ix(xj)]j)] = arg min ww[tj wihi(xj)] • MAP esmaon for discrete NB, simply: w 2 = arg min [tj=1j=1 wihi(ixj)] – w j=1 XX i XX Prior: Xj=1 Xi X XCountCount(Y(Y==yy)) P (Y = y)=Count(Y = y) P (Y P=(Yy)== y)=Count(Y = y) P (Y = y)= yCountCount(Y(Y==yy)) County0 0 (Y = y ) y0 Count(Y = y ) – Observaon distribuon: y0 PP PPCountCount(X(Xi =i =x,x, Y Y= y=) y+) a P (PX(iX=i =x xYY==yy)=)= | | x CountCount(X(Xi =i =x,Yx ,Y= y=) +y )|X_i|*a x00 P • Called “smoothing”. Corresponds to Dirichlet prior!P

2

2 2 2 What about if there is missing data? • One of the key strengths of Bayesian approaches is that Missing data Missing data (2) they can naturally handle missing data

 Suppose don’t have value for some attribute Xi • Ex: three coin tosses: Event = {X1=H, X2=?, X3=T}

 applicant’s credit history unknown • event = head, unknown (either head or tail), tail

 some medical test not performed on patient • event = {H,H,T} + {H,T,T}

 how to compute P(X1=x1 … Xj=? … Xd=xd | y) • P(event) = P(H,H,T) + P(H,T,T)

 Easy with Naïve Bayes • General case: Xj has missing value

 ignore attribute in instance where its value is missing

 compute likelihood based on observed attributes

 no need to “fill in” or explicitly model missing values

 based on between attributes

[Slide from Victor Lavrenko and Nigel Goddard]

Copyright © Victor Lavrenko, 2013 Copyright © Victor Lavrenko, 2013

Summary Computational complexity

 Naïve • One of the fastest learning methods - explicitly handles class priors - “normalizes” across observations: outliers comparable • O(nd+cd) training time complexity - assumption: all dependence is “explained” by class label • c … number of classes  Continuous example • n … number of instances - unable to handle correlated data • d … number of dimensions (attributes)  Discrete example • both learning and prediction - fooled by repetitions • no hidden constants (number of iterations, etc.) - must deal with zero-frequency problem • testing: O(ndc) - Pros: • O(dc) space complexity - handles missing data - good computational complexity • only decision trees are more compact - incremental updates

Copyright © Victor Lavrenko, 2013 Copyright © Victor Lavrenko, 2013 Naive Bayes = Linear Classifier

Theorem: assume that x i 0 , 1 for all i [1 ,N ] . Then, the Naive Bayes classifier { }is defined by x sgn(w x + b), ⇥ · Pr[xi=1 +1] Pr[xi=0 +1] where w i =log Pr[ x =1 | 1] log Pr[ x =0 | 1] i | i | Pr[+1] N Pr[xi=0 +1] and b =logPr[ 1] + i=1 log Pr[x =0| 1] . i | Proof: observe that for any i [1 ,N ] , Pr[x +1] Pr[x =1 +1] Pr[x =0 +1] Pr[x =0 +1] log i | = log i | log i | x +log i | . Pr[x 1] Pr[x =1 1] Pr[x =0 1] i Pr[x =0 1] i | i | i | i |

[Slide from Mehyrar Mohri] Mehryar Mohri - Introduction to page 20 Outline of lecture

• Review of probability • Maximum likelihood esmaon

2 examples of Bayesian classifiers: • Naïve Bayes • Logisc regression

[Next several slides adapted from: Vibhav Gogate, Luke Zettlemoyer, Carlos Guestrin, and Dan Weld] Logisc Regression

 Learn P(Y|X) directly!  Assume a particular functional form ✬ ? On one side we say P(Y=1|X)=1, and on the other P(Y=1|X)=0 ✬ But, this is not differentiable (hard to learn)… doesn’t allow for label noise...

P(Y=1)=1

P(Y=1)=0 Copyright c 2010, Tom M. Mitchell. 7 ⇥ Copyright c 2010, Tom M. Mitchell. 7 ⇥ where the superscript j refers to the jth training example, and where (Y = yk) is 1 if Y = yk and 0 otherwise. Note the role of here is to select only those training where the superscript j refers to the jth training example, and where (Y = yk) is examples for which Y = yk. 1 if Y = yk and 0 otherwise. Note the role of here is to select only those training The maximum likelihood estimator for ⇥2 is examples for which Y = yk. ik The maximum likelihood estimator for 2 is 1 ⇥ik ˆ 2 j 2 j ⇥ik = j ⇤(Xi µˆik) (Y = yk) (14) ⇤ j (Y1 = yk) ˆ 2 j j 2 j ⇥ik = j ⇤(Xi µˆik) (Y = yk) (14) ⇤ j (Y = yk) j This maximum likelihood estimator is biased, so the minimum variance unbi- asedThis estimator maximum (MVUE) likelihood is sometimes estimator used is biased, instead. so It the is minimum variance unbi- ased estimator (MVUE) is sometimes used instead. It is 1 ˆ 2 j 2 j ⇥ik = j ⇤(Xi µˆik) (Y = yk) (15) 2 (⇤ j (Y 1= yk)) 1 j j 2 j ⇥ˆ = (X µˆik) (Y = yk) (15) ik (⇤ (Y j = y )) 1 ⇤ i j k Logisc Regression j 3 3 Logistic Regression Logistic Regression is an approach to learning functions of the formLogisticf : X functionY, or (Sigmoid): ⇤ PLogistic(Y X) in Regression the case where is an approachY is discrete-valued, to learning functions and X = ofX the... formX fis: anyX vectorY, or | • Learn P(Y|X) directly! ⌅ 1 n⇧ ⇤ containingP(Y X) in the discrete case or where continuousY is discrete-valued, variables. In this and sectionX = X we... willX primarilyis any vector con- | ⌅ 1 n⇧ sidercontaining the case discrete where orY• continuousisAssume a boolean variables. a variable, particular In in this order section to simplify we will notation. primarily In con- the 1 z finalsider subsection the case where we extendY isfunctional a boolean our treatment variable, form to the in order case where to simplifyY takes notation. on any In finite the 1+e final subsection we extend our treatment to the case where Y takes on any finite number of discrete values.• Sigmoid applied to a linear numberLogistic of discrete Regression values. assumes a parametric form for the distribution P(Y X), Logistic Regressionfunction assumes a of parametric the data: form for the distribution P(Y |X), then directly estimates its parameters from the training data. The parametric| modelthen directly assumed estimates by Logistic its parameters Regression infrom the the case training where Y data.is boolean The parametric is: z model assumed by Logistic Regression in the case where Y is boolean is: 1 P(Y = 1 X)= (16) | 1 + exp(w 1+ n w X ) P(Y = 1 X)= 0 ⇤ni=1 i i (16) | 1 + exp(w0 + ⇤i=1 wiXi) Features can be and n and exp(w0 + ⇤ wiXi) discrete or P(Y = 0 X)= ni=1 (17) exp(w0 + ⇤i=1nwiXi) continuous! P(Y = 0|X)= 1 + exp(w0 + ⇤ni=1 wiXi) (17) | 1 + exp(w0 + ⇤ wiXi) Notice that equation (17) follows directly from equationi=1 (16), because the sum of theseNotice two that probabilities equation (17) must follows equal directly 1. from equation (16), because the sum of theseOne two highly probabilities convenient must property equal 1. of this form for P(Y X) is that it leads to a One highly convenient property of this form for P(Y |X) is that it leads to a simple linear expression for classification. To classify any| given X we generally simple linear expression for classification. To classify any given X we generally want to assign the value y that maximizes P(Y = y X). Put another way, we want to assign the value yk that maximizes P(Y = yk|X). Put another way, we assign the label Y = 0 if thek following condition holds:k| assign the label Y = 0 if the following condition holds: P(Y = 0 X) 1 < P(Y = 0|X) 1 < P(Y = 1|X) P(Y = 1|X) | substituting from equations (16) and (17), this this becomes becomes n n 1 < exp(w0 + wiXi) 1 < exp(w0 + ⇤wiXi) i⇤=1 i=1 Logisc Funcon in n Dimensions

Sigmoid applied to a linear function of the data:

-2 0 2 4 6 -4 8 10 0.2 0.4 0.6 0.8 1 x Features can be discrete or continuous! Copyright c 2010, Tom M. Mitchell. 7 ⇥ Copyright c 2010, Tom M. Mitchell. 7 ⇥ where the superscript j refers to the jth training example, and where (Y = yk) is Copyright c 2010, Tom M. Mitchell. 7 1 if Y = yk and 0 otherwise. Note the role of ⇥ here is to select only those training whereexamples the superscript for whichj Yrefers= yk to. the jth training example, and where (Y = yk) is 1 if Y =Theyk and maximum 0 otherwise. likelihood Note estimator the role of for⇥here2 is is to select only those training where theik superscript j refers to the jth training example, and where (Y = yk) is examples for which Y = yk. 1 if Y =2 yk and 0 otherwise. Note the role of here is to select only those training CopyrightThec 2010, maximum Tom M. likelihood Mitchell. estimator1 for ⇥ isj 7 ⇥ ˆ 2 ik 2 j ⇥ik = j examples⇤(Xi forµˆ whichik) (YY = yk). (14) ⇤ j (Y = yk) j 2 1 Thej maximum likelihood estimator for ⇥ik is where the superscript j⇥ˆrefers2 = to the jth training(X example,µˆ )2 and(Y j where= y ) (Y = y ) is (14) ik (Y j = y ) ⇤ i ik k k 1 if Y = y andThis 0 otherwise.maximum likelihood Note⇤ j the role estimatork of j here is biased, is to select so the only minimum those1 training variance unbi- k ˆ 2 j 2 j ased estimator (MVUE) is sometimes used instead. It⇥ isik = j ⇤(Xi µˆik) (Y = yk) (14) examples for which Y = yk. ⇤ j (Y = yk) This maximum likelihood estimator2 is biased, so the minimum variance unbi-j The maximum likelihood estimator1 for ⇥ik is ased estimator (MVUE)ˆ 2 = is sometimes used instead.( j It is)2 ( j = ) ⇥ik j This⇤ maximumXi µˆik likelihood Y yk estimator is(15) biased, so the minimum variance unbi- (⇤ j (Y = yk)) 1 2 1 1 j j2 j ⇥ˆ ik =2 (Xasedi µ estimatorˆik) j(Y (MVUE)=2yk) j is sometimes(14) used instead. It is ⇥ˆ ik = (Y j = y ) ⇤ (Xi µˆik) (Y = yk) (15) ⇤ j(⇤ (Y j =k y j)) 1 ⇤ j k j 1 3 Logistic Regression ⇥ˆ 2 = (X j µˆ )2(Y j = y ) (15) CopyrightThis maximumc 2010, likelihood Tom M. estimator Mitchell. is biased, so the minimumik ( variance(Y j = y unbi-)) 1 ⇤ 8i ik k ⇤ j k j ased estimatorLogistic (MVUE) Regression is sometimes is an approach used to instead. learning It is functions of the form f : X Y, or 3 Logistic Regression ⇤ P(Y X) in the case where Y is discrete-valued, and X = X1 ...Xn is any vector | 2 1 j 2 j ⌅ ⇧ Logisticcontaining Regression⇥ˆ = discrete is an or continuousapproach to variables.3 learning(X Logisticµˆ functionsik In) this(Y section= Regression ofy thek) we form willf primarily: X(15)Y, con- or ik (⇤ (Y j = y )) 1 ⇤ i ⇤ P(YsiderX) in the the case case wherej whereY isY aisk boolean discrete-valued, j variable, andin orderX = toX simplify1 ...Xn notation.is any vector In the | Logistic Regression⌅ is an approach⇧ to learning functions of the form f : X Y, or containingfinal subsection discrete or we continuous extend our variables. treatment In to this the casesection where we willY takes primarily on any con- finite ⇤ sider the case where Y is a boolean variable,P(Y X) inin order the case to simplify where Y notation.is discrete-valued, In the and X = X1 ...Xn is any vector 3 Logisticnumber of Regression discrete values. | ⌅ ⇧ final subsectionLogistic Regression we extend our assumes treatment acontaining parametric to the case discrete form where for or the continuousY takes distribution on variables. anyP finite(Y X) In, this section we will primarily con- | numberthen of directly discrete estimates values. its parameterssider from the case the where trainingY is data. a boolean The parametric variable, in order to simplify notation. In the Logistic Regression is an approach1 to learning functions of the form f : X Y, or Logistic Regression assumes a parametricfinal subsection form for we the extend distribution our⇤ treatmentP(Y X to), the case where Y takes on any finite P(Y X) inmodel the case assumed where byY Logisticis discrete-valued, Regression in and theX case= X where1 ...XYn isis boolean any vector is: | 0.8 ⌅ ⇧ | containingthen directly discrete estimates or continuous its parameters variables.number In from this ofthe section discrete training we values. will data. primarily The parametric con- model assumed by Logistic Regression0.6 inLogistic the case1 Regression where Y is booleanassumes is: a parametric form for the distribution P(Y X), sider the case where Y is a booleanP(Y Y = variable,1 X)= in order to simplifyn notation. In the (16) | | 1 + expY = 1/(1( w+ exp(+−X)) w X ) final subsection we extend our treatment0.4 tothen the directly case where0 estimates⇤i=Y1 takesi i its on parameters any finite from the training data. The parametric model assumed1 by Logistic Regression in the case where Y is boolean is: number ofand discrete values.P(Y = 10.2X)= n (16) | 1 + exp(w0 + ⇤i=n 1 wiXi) Logistic Regression assumes a0 parametricexp form(w0 for+ ⇤ thei=1 w distributioniXi) P(Y X), P(Y =−50 X)= 0 5 (17) 1 X n | thenand directly estimates its parameters| from1 + theexp training(w0 + ⇤ data.i=1 wPiX( TheYi) = parametric1 X)= n (16) n | 1 + exp(w0 + ⇤i= wiXi) model assumed by Logistic RegressionLogisc Regression: decision boundary in theexp case(w0 + where⇤i=1YwiisXi boolean) is: 1 Notice that equationP(Y (17)= 0 followsX)= directly from equationn (16), because the sum(17) of | 1and+ exp(w0 + ⇤i=1 wiXi) these two probabilities must equal 1. 1 n exp(w0 + ⇤ wiXi) NoticeOne that equation highlyP(Y convenient= (17)1 X follows)= property directly of from thisn form equation for P (16),(PY(YX because)=is0 thatX)= the it(16) leads sum of to a i=1 (17) | 1 + exp(w0 + ⇤i=1 wiXi) | n thesesimple two probabilities linear expression must for equal classification. 1. To classify any given| X we1 generally+ exp(w0 + ⇤i=1 wiXi) and wantOne highly to assign convenient the value propertyyk that maximizes ofNotice this form thatP equation( forY =Py(Yk X (17)X)).is Put follows that another it leadsdirectly way, to from a we equation (16), because the sum of • Prediction: Output then Y with || simpleassign linear the expression label Y = 0 for if the classification. followingexp(wthese0 + condition⇤ two Toi=1 classify probabilitieswiX holds:i) any given mustX equalwe generally 1. P(Y = 0 X)= n (17) highest P(Y|X)+ ( + ( = ) ) = 0 Figurewant 1:to assign Form the of the value logistic| yk that1 function. maximizesexp w0One In⇤P highlyi= LogisticY1 wiXy convenientik X Regression,. Put another propertyP( way,Y ofX this) weis form as-0 for P(Y X) is that it leads to a – For binary Y, outputP(Y Y=0= 0 Xif ) | | | Noticesumedassign that to equation the follow label (17)thisY = form. follows0 if the following directly1 from

3.1 Form of P(Y X) for Gaussian Naive Bayes Classifier | Here we derive the form of P(Y X) entailed by the assumptions of a Gaussian | Naive Bayes (GNB) classifier, showing that it is precisely the form used by Logis- tic Regression and summarized in equations (16) and (17). In particular, consider a GNB based on the following modeling assumptions:

Y is boolean, governed by a , with parameter = • P(Y = 1)

X = X ...X , where each X is a continuous random variable • ⇤ 1 n⌅ i Likelihood vs. Condional Likelihood

Generave (Naïve Bayes) maximizes Data likelihood

Discriminave (Logisc Regr.) maximizes Condional Data Likelihood

Focuses only on learning P(Y|X) - all that maers for classificaon Maximizing Condional Log Likelihood

0 or 1!

Bad news: no closed-form solution to maximize l(w) Good news: l(w) is concave function of w! No local maxima Concave functions easy to optimize Opmizing concave funcon – Gradient ascent • Condional likelihood for Logisc Regression is concave ! Gradient:

Learning rate, η>0 Update rule: 2 2 2 2 error = (t tˆ )2 = t2 w h (x ) errori =ˆi 2 (ti tˆi2)i = kti k i wkhk(xi) error = error(ti =ti) =(ti tˆi⇤)ti = wkthi k(xi) w⌅khk(xi) i i i ⇤ ki ⇤ ⌅k ⌅ i ⌥i i k⌥i ⇤ k⌥ ⌅ P (Y =1XP)(Y =1exp(Xw10) +exp(ww1iX+i) w X ) P (Y =1|PX()Y⇥=1exp(Xw)10 +exp(ww1i10X+i) w 1Xi )i | ⇥ | ⇥ i 10 1i i | ⇥ i i i⌥ P (Y =2X) exp(w20 + w2iXi) P (Y =2|XP)(Y⇥exp(=2wX20) + exp(ww2i20Xi+) w2iXi) P (Y =2X| ) ⇥exp(i w20 + w2iXi) | ⇥ | ⇥ i i i⌥ r 1 r 1 r r1 1 P (Y = r X)=1 P (Y =j X) P (Y = r|XPP()=1(YY==rrXX)=1)=1P (Y = j X|PP)(Y(Y==j jXX) ) 2 | |j=1 | | | j=1 =1 | ˆ 2 j=1⌥j error = w0+ (tiwiXiti) = ti wkhk(xi) w0+ i wiXi 1 j e i ww+0+ wi wXiXi j 1 = y j ln i j ee 0 +(1i i i iyj ⇤) ln j 1 1⌅ = y ln wj0P+ w X +(1 y ) ln jkw0+ w X == wy0Pln+ln iwiiXii +(1+(1 wy0)+ ln) ln wi iXi i i Maximize Condional Log Likelihood: Gradient ascent 1+ey i w0P+ wiXi 1+1+ye i w0+ wiXi j e 1+ew0P+ i wi iXi e 1+we0+ i wiiXi j j 1+e 1+e ⌥j PP PP P(Y =1X) PPexp(w10 + w1iXi) PP ll(w) j j j j = ⌅l(l(wx|wj)) yj⇥ P (Yjj =1xj ,w) = xii y P (jY j=1j x ,wji)j j j w == xxi yy |P|P(Y(Y =1=1xx,w,w) ) jj⌅ww i | | i j j ⇥ ⌥ ⇥ ⇥ P (Y =2X) exp( w20 + w2iXi)⇥ ll((ww)) j | jj ⇥ j j == ⌅ll((ww)) yy (w0 +⌅ jjwwiixxii)) lnlnj j 11 + +⌅ exp( exp(i ww00++ wwixiix)i ) j j ww ⇧⇧ww== yy((ww00++wwwwixixi ⇤)i⇤) lnln 11 + + exp( exp(w0w+0⌅⌃+⌅⌃wiwxiix)i ) j⌅jwwi ⌅wwii ⌅wwi i i jj ⇧⇧ i i r 1 ⇤⇤ i i ⌅⌃⌅⌃ ⌥ ⌥ ⌥ jjexp( + j)j P (Yj =jj r Xxxii)=1exp(ww00 +j i iwPwjixi(xijYi ) = j Xj) == y xx x exp(w0 + wix ) ii | j j =i y xij j i | i ⇧ = 11 + +y exp(x exp(i ww0 ++ wwixx))⌃ j ⇧ 0 j⌥=1⌥j i i i i i ⌃ j j ⇧ 1 + exp(w0 + i wixi )⌃ ⌥ ⌥ ⌥ j j w0+ i wexp(iXi w0 + ⌥w x ) j ej jj exp(w0 + ii wi ixi ji ) ⌥ j 1 == xi yy j exp(w0 + wix ) = y ln i j +(1 y )j j lni i ⇧=w0P+11x +ii +w exp(y exp(iXiw0 ++ i wixi ))⌃ w0+ i wiXi j1+⇧e w0⌥⌥ i wixi ⌃1+ej j j ⇧ 1 + exp(w0 + i wixi )⌃ ⌥ P ⌥⌥ P 1 ⌥ ⌅l(w) j j j j = x y P (Yax =1x ,w) ⌅w i 1+e | i j 2 ⇥ ln p(w) wi ⌅l(w) ⌅ ⇥⌅ 2 j j i j = y (w0 + wixi ) ln 1 + exp(w0 + wixi ) ⌅wi ⇧⌅w ⌅ ln p(w⌅) w ⇤ ⌅⌃ j i = wi i ⌅wi j j x exp(w0(x +µ )2 wix ) j j i ik i i = y x 1 22 i e i j ⇧ 1⇤ +⇤ exp(2⇥ w0 + i wixi )⌃ j i ⌥ exp(w + ⌥w xj) = xj yj 0 i i i i 1 + exp( + j) j ⇧ 3 w0 ⌥ i wixi ⌃ 3 3 1 3 ⌥ ax 1+e ln p(w) w2 ⇥2 i i ⌅ ln p(w) = wi ⌅wi

(x µ )2 ik 1 22 e i ⇤i⇤2⇥

3 Gradient Ascent for LR

Gradient ascent algorithm: ( η > 0) do:

For i=1 to n: (iterate over features)

unl “change” < ε Loop over training examples! Naïve Bayes vs. Logisc Regression Learning: h:X  Y X – features Y – target classes

Generave Discriminave • Assume funconal form for • Assume funconal form for – P(X|Y) assume cond indep – P(Y|X) no assumpons – P(Y) – Est. params from train data – Est params from training data • Gaussian NB for cont. features • Handles discrete & cont features • Bayes rule to calc. P(Y|X= x): – P(Y | X) ∝ P(X | Y) P(Y) • Indirect computaon • Directly calculate P(Y|X=x) – Can generate a sample of the data – Can’t generate data sample – Can easily handle missing data Naïve Bayes vs. Logisc Regression [Ng & Jordan, 2002] • Generave vs. Discriminave classifiers • Asymptoc comparison (# training examples  infinity) – when model correct • NB, Linear Discriminant Analysis (with class independent ), and Logisc Regression produce idencal classifiers

– when model incorrect • LR is less biased – does not assume condional independence – therefore LR expected to outperform NB Naïve Bayes vs. Logisc Regression

[Ng & Jordan, 2002] • Generave vs. Discriminave classifiers • Non-asymptoc analysis – convergence rate of parameter esmates, (n = # of aributes in X) • Size of training data to get close to infinite data soluon • Naïve Bayes needs O(log n) samples • Logisc Regression needs O(n) samples

– Naïve Bayes converges more quickly to its (perhaps less helpful) asymptoc esmates Naïve bayes Logistic Regression

Some experiments from UCI data sets

©Carlos Guestrin 2005-2009 67