<<

Machine Learning

10-701/15-781, Spring 2008

Naïve

Eric Xing Lecture 3, January 23, 2006

Reading: Chap. 4 CB and handouts

Classification

z Representing :

z Choosing hypothesis class

z Learning: h:X a Y z X – features

z Y – target classes

1 Suppose you know the following …

z Classification-specific Dist.: P(X|Y)

p(X |Y = 1) r = p1 (X ;µ1 ,Σ1 )

p(X | Y = 2) r = p2 (X ; µ2 ,Σ2 ) Bayes classifier:

z Class prior (i.e., "weight"): P(Y)

z This is a of the data!

Optimal classification

z Theorem: Bayes classifier is optimal!

z That is

z Proof:

z Recall last lecture:

z How to learn a Bayes classifier?

2 Recall Bayes Rule

Which is shorthand for:

Equivalently:

Recall Bayes Rule

Which is shorthand for:

Common abbreviation:

3 Learning Bayes Classifier

z Training data:

z Learning = estimating P(X|Y), and P(Y)

z Classification = using Bayes rule to calculate P(Y | Xnew)

How hard is it to learn the optimal classifier?

z How do we represent these? How many parameters?

z Prior, P(Y):

z Suppose Y is composed of k classes

z Likelihood, P(X|Y):

z Suppose X is composed of n binary features

z Complex model ! High with limited data!!!

4 Naïve Bayes:

assuming that Xi and Xj are conditionally independent given Y, for all i≠j

Conditional Independence

z X is conditionally independent of Y given Z, if the distribution governing X is independent of the value of Y, given the value of Z

Which we often write

z e.g.,

z Equivalent to:

5 The Naïve Bayes assumption

z Naïve Bayes assumption:

z Features are independent given class:

z More generally:

z How many parameters now?

z Suppose X is composed of n binary features

The Naïve Bayes Classifier

z Given:

z Prior P(Y)

z n conditionally independent features X given the class Y

z For each Xi, we have likelihood P(Xi|Y)

z Decision rule:

z If assumption holds, NB is optimal classifier!

6 Naïve Bayes Algorithm

z Train Naïve Bayes (examples)

z for each* value yk z estimate

z for each* value xij of each attribute Xi z estimate

z Classify (Xnew)

* must sum to 1, so need estimate only n-1 parameters...

Learning NB: parameter estimation

z Maximum Likelihood Estimate (MLE): choose θ that maximizes probability of observed data D

z Maximum a Posteriori (MAP) estimate: choose θ that is most probable given and the data

7 MLE for the parameters of NB

Discrete features:

z Maximum likelihood estimates (MLE’s):

z Given dataset

z Count(A=a,B=b) ≡ number of examples where A=a and B=b

Subtleties of NB classifier 1 – Violating the NB assumption

z Often the Xi are not really conditionally independent

z We use Naïve Bayes in many cases anyway, and it often works pretty well

z often the right classification, even when not the right probability (see [Domingos&Pazzani, 1996])

z But the resulting probabilities P(Y|Xnew) are biased toward 1 or 0 (why?)

8 Subtleties of NB classifier 2 – Insufficient training data

z What if you never see a training instance where X1000=a when Y=b?

z e.g., Y={SpamEmail or not}, X1000={‘Rolex’}

z P(X1000=T | Y=T) = 0

z Thus, no matter what the values X2,…,Xn take:

z P(Y=T | X1,X2,…,X1000=T, …, Xn) = 0

z What now???

MAP for the parameters of NB

Discrete features:

z Maximum a Posteriori (MAP) estimate: (MAP’s):

z Given prior:

z Consider binary feature

z θ is a Bernoulli rate

αT −1 α F −1 Γ(αT +α F ) αT −1 α F −1 θ (1 −θ ) P(θ;αT ,α F ) = θ (1 −θ ) = Γ(αT )Γ(α F ) B(αT ,α F )

z Let βa=Count(X=a) ≡ number of examples where X=a

9 Bayesian learning for NB parameters – a.k.a. smoothing

z Posterior distribution of θ

z Bernoulli:

z Multinomial

z MAP estimate:

z Beta prior equivalent to extra thumbtack flips

z As N → 1, prior is “forgotten”

z But, for small sample size, prior is important!

MAP for the parameters of NB

z Dataset of N examples

z Let βiab=Count(Xi=a,Y=b) ≡ number of examples where Xi=a and Y=b

z Let γb=Count(Y=b) z Prior

Q(Xi|Y) ∝ Multinomial(αi1, …, αiK) or Multinomial(α/K)

Q(Y) ∝ Multinomial(τi1, …, τiM) or Multinomial(τ/M)

m “virtual” examples

z MAP estimate

z Now, even if you never observe a feature/class, never zero

10 Text classification

z Classify e-mails

z Y = {Spam,NotSpam}

z Classify news articles

z Y = {what is the topic of the article?}

z Classify webpages

z Y = {Student, professor, project, …}

z What about the features X?

z The text!

Features X are entire document – Xi for ith word in article

11 NB for Text classification

z P(X|Y) is huge!!!

z Article at least 1000 words, X={X1,…,X1000} th z Xi represents i word in document, i.e., the domain of Xi is entire vocabulary, e.g., Webster Dictionary (or more), 10,000 words, etc.

z NB assumption helps a lot!!!

z P(Xi=xi|Y=y) is just the probability of observing word xi in a document on topic y

Bag of words model

z Typical additional assumption – Position in document

doesn’t matter: P(Xi=xi|Y=y) = P(Xk=xi|Y=y) z “Bag of words” model – order of words on the page ignored

z Sounds really silly, but often works very well!

When the lecture is over, remember to wake up the person sitting next to you in the lecture room.

12 Bag of words model

z Typical additional assumption – Position in document

doesn’t matter: P(Xi=xi|Y=y) = P(Xk=xi|Y=y) z “Bag of words” model – order of words on the page ignored

z Sounds really silly, but often works very well!

in is lecture lecture next over person remember room sitting the the the to to up wake when you

Bag of Words Approach

aardvark 0 about 2 all 2 Africa 1 apple 0 anxious 0 ... gas 1 ... oil 1 … Zaire 0

13 NB with Bag of Words for text classification

z Learning phase:

z Prior P(Y)

z Count how many documents you have from each topic (+ prior)

z P(Xi|Y) z For each topic, count how many times you saw word in documents of this topic (+ prior)

z Test phase:

z For each document

z Use naïve Bayes decision rule

Twenty News Groups results

14 Learning curve for Twenty News Groups

What if we have continuous Xi ?

th z Eg., character recognition: Xi is i pixel

z Gaussian Naïve Bayes (GNB):

Sometimes assume variance

z is independent of Y (i.e., σi),

z or independent of Xi (i.e., σk) z or both (i.e., σ)

15 Estimating Parameters: Y discrete, Xi continuous

z Maximum likelihood estimates: jth training example

δ(x)=1 if x true, else 0

Gaussian Naïve Bayes

16 Example: GNB for classifying mental states [Mitchell et al.]

~1 mm resolution ~2 images per sec. 15,000 voxels/image non-invasive, safe

measures Blood Oxygen Level 10 sec Dependent (BOLD) response

Typical impulse response

Brain scans can track activation with precision and sensitivity

[Mitchell et al.]

17 Gaussian Naïve Bayes: Learned µvoxel,word P(BrainActivity | WordCategory = {People,Animal})[Mitchell et al.]

Learned Bayes Models – for

P(BrainActivity | WordCategory) [Mitchell et al.] Pairwise classification accuracy: 85% People words Animal words

18 What you need to know about Naïve Bayes

z using Bayes Classifier

z Naïve Bayes classifier

z What’s the assumption

z Why we use it

z How do we learn it

z Why is Bayesian estimation important

z Text classification

z Bag of words model

z Gaussian NB

z Features are still conditionally independent

z Each feature has a Gaussian distribution given class

19