Last Time

CSE 446 • Learning Gaussians Gaussian Naïve Bayes & • Naïve Bayes Winter 2012 Today • Gaussians Naïve Bayes Dan Weld • Logistic Regression

Some slides from Carlos Guestrin, Luke Zettlemoyer 2

Text Classification Bayesian Learning Bag of Words Representation

Prior aardvark 0 Use Bayes rule: Data Likelihood about 2 all 2 Africa 1 Posterior P(Y | X) = P(X |Y) P(Y) apple 0 P(X) anxious 0 ... Normalization gas 1 ... oil 1 Or equivalently: P(Y | X)  P(X | Y) P(Y) … Zaire 0

Naïve Bayes The Distributions We Love Discrete Continuous • Naïve Bayes assumption: – Features are independent given class: Binary {0, 1} k Values

Single Bernouilli Event – More generally: Sequence Binomial Multinomial (N trials)

N= H+T • How many parameters now? • Suppose X is composed of n binary features Conjugate Beta Dirichlet

Prior 6

1 NB with Bag of Words for Text Classification Easy to Implement • Learning phase: – Prior P(Ym) • But… • Count how many documents from topic m / total # docs

– P(Xi|Ym) • Let Bm be a bag of words formed from all the docs in topic m • Let #(i, B) be the number of times word iis in bag B • If you do… it probably won’t work…

• P(Xi | Ym) = (#(i, Bm)+1) / (W+j#(j, Bm)) where W=#unique words • Test phase: – For each document • Use naïve Bayes decision rule

8

Probabilities: Important Detail! Naïve Bayes Posterior Probabilities

• Classification results of naïve Bayes • P(spam | X … X ) = P(spam | X ) 1 n i i – I.e. the class with maximum … Any more potential problems here? – Usually fairly accurate (?!?!?) • However, due to the inadequacy of the . We are multiplying lots of small numbers assumption… Danger of underflow! – Actual posterior‐probability estimates not accurate. . 0.557 = 7 E -18 – Output probabilities generally very close to 0 or 1. . Solution? Use logs and add! log(p1)+log(p2) . p1 * p2 = e

. Always keep in log form 10

Twenty News Groups results Learning curve for Twenty News Groups

2 Bayesian Learning Bayesian Learning What if Features are Continuous? What if Features are Continuous?

Prior Eg., Character Recognition: Eg., Character Recognition: th th Xi is i pixel Xi is i pixel

Posterior P(Y | X)  P(X | Y) P(Y) P(Y | X)  P(X | Y) P(Y)

P(Xi =x| Y=yk) = N(ik, ik) Data Likelihood

N(ik, ik) =

Gaussian Naïve Bayes Learning Gaussian Parameters

Sometimes Assume Maximum Likelihood Estimates: –  is independent of Y (i.e., i), • Mean: – or independent of Xi (i.e., k) – or both (i.e., )

P(Y | X)  P(X | Y) P(Y) • Variance: P(Xi =x| Y=yk) = N(ik, ik)

N(ik, ik) =

Learning Gaussian Parameters Learning Gaussian Parameters

Maximum Likelihood Estimates: Maximum Likelihood Estimates: • Mean: jth training • Mean: example

• Variance: xif x true, • Variance: else 

3 Example: GNB for classifying mental states [Mitchell et al.]

~1 mm resolution

~2 images per sec. Brain scans can track activation 15,000 voxels/image with precision non-invasive, safe and sensitivity 10 sec measures Blood Oxygen Level Dependent (BOLD) Typical response impulse response [Mitchell et al.]

Gaussian Naïve Bayes: Learned voxel,word Gaussian Naïve Bayes: Learned voxel,word P(BrainActivity | WordCategory = {People,Animal}) P(BrainActivity | WordCategory = {People,Animal}) [Mitchell et al.] [Mitchell et al.] Pairwise classification accuracy: 85% People words Animal words

What You Need to Know about What’s (supervised) learning more formally Naïve Bayes • Given: • Optimal Decision using – Dataset: Instances {x1;t(x1),…, xN;t(xN)} • Naïve Bayes Classifier • e.g., xi;t(xi) = (GPA=3.9,IQ=120,MLscore=99);150K – What’s the assumption – Hypothesis space: H – Why we use it • e.g., polynomials of degree 8 – How do we learn it – Loss function: measures quality of hypothesis hH • Text Classification • e.g., squared error for regression – Bag of words model • Obtain: • Gaussian NB – Learning : obtain hH that minimizes loss function • e.g., using closed form solution if available – Features still conditionally independent • Or greedy search if not – Features have Gaussian distribution given class • Want to minimize prediction error, but can only minimize error in dataset

24

4 Types of (supervised) learning problems, Learning is (simply) function revisited approximation! • Decision Trees, e.g., • The general (supervised) learning problem: – dataset: votes; party – Given some data (including features), hypothesis space, loss function – hypothesis space: – Learning is no magic! – Loss function: – Simply trying to find a function that fits the data

• NB Classification, e.g., • Regression – dataset: brain image; {verb v. noun} – hypothesis space: • Density estimation – Loss function:

• Density estimation, e.g., • Classification – dataset: grades – hypothesis space: • (Not surprisingly) Seemly different problem, very similar – Loss function: solutions…

25 26 ©Carlos Guestrin 2005 2009

What you need to know about Generative vs. Discriminative Classifiers

• Want to Learn: h:X  Y • Learning is function approximation – X –features – Y –target classes • Bayes optimal classifier –P(Y|X) • Generative classifier, e.g., Naïve Bayes: • What functions are being optimized? – Assume some functional form for P(X|Y), P(Y) – Estimate parameters of P(X|Y), P(Y) directly from training data – Use Bayes rule to calculate P(Y|X= x) – This is a ‘generative’ model • Indirect computation of P(Y|X) through Bayes rule

• As a result, can also generate a sample of the data, P(X) = y P(y) P(X|y) • Discriminative classifiers, e.g., Logistic Regression: – Assume some functional form for P(Y|X) – Estimate parameters of P(Y|X) directly from training data – This is the ‘discriminative’ model • Directly learn P(Y|X) • But cannot obtain a sample of the data, because P(X) is not available

27 28

Logistic Regression Logistic Regression

 Learn P(Y|X) directly!  Learn P(Y|X) directly!  Assume a particular functional form  Assume a particular functional form  Not differentiable…   Aka Sigmoid

P(Y)=0

P(Y)=1

29 30

5 Logistic Function in n Dimensions Understanding Sigmoids

1 Sigmoid applied to a linear function of the data: 0.9 0.8

0.7

0.6 w0=-2, w1=-1 0.5 0.4

0.3 1 0.2 0.1

0.8 0 -6 -4 -2 0 2 4 6 0.6 1 1 0.9 0.9 0.4 0.8 0.8 0.7 0.7 0.2 -4 0.6 0.6 0-2 0.5 0.5 0 2 0.4 0.4 -2 x 0.3 0.3 0 6 4 2 2 4 8 0.2 0.2 x1 6 10 0.1 0.1 0 0 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6 w =0, w =-1 w =0, w = -0.5 Features can be discrete or continuous! 31 0 1 0 1

Very convenient! Loss functions: Likelihood vs. Conditional Likelihood

• Generative (Naïve Bayes) Loss function: Data likelihood implies

implies • Discriminative models cannot compute P(xj|w)! • But, discriminative (logistic regression) loss function: Conditional Data Likelihood

linear classification implies rule! – Doesn’t waste effort learning P(X) –focuses on P(Y|X) all that matters for classification

33 35 ©Carlos Guestrin 2005-2009 ©Carlos Guestrin 2005-2009

Expressing Conditional Log Likelihood Maximizing Conditional Log Likelihood

Good news: l(w) is concave function of w ! no locally optimal solutions Bad news: no closed-form solution to maximize l(w) Good news: concave functions easy to optimize

©Carlos Guestrin 2005-2009 36 ©Carlos Guestrin 2005-2009 37

6 Optimizing concave function – Maximize Conditional Log Likelihood: Gradient Gradient ascent ascent • Conditional likelihood for Logistic Regression is concave ! Find optimum with gradient ascent

Gradient:

Learning rate , >0 Update rule:

• Gradient ascent is simplest of optimization approaches – e.g., Conjugate gradient ascent much better (see reading)

38 ©Carlos Guestrin 2005-2009 39 ©Carlos Guestrin 2005-2009

Gradient Descent for LR That’s all M(C)LE. How about MAP?

• One common approach is to define priors on w Gradient ascent algorithm: iterate until change <  – Normal distribution, zero mean, identity covariance – “Pushes” parameters towards zero • Corresponds to Regularization – Helps avoid very large weights and overfitting – More on this later in the semester For i=1,…,n, • MAP estimate

repeat

40 41 ©Carlos Guestrin 2005-2009 ©Carlos Guestrin 2005-2009

M(C)AP as Regularization Large parameters  Overfitting

• If data is linearly separable, weights go to infinity • Leads to overfitting:

• Penalizing high weights can prevent overfitting… – again, more on this later in the semester Penalizes high weights, also applicable42 in linear regression 43 ©Carlos Guestrin 2005-2009 ©Carlos Guestrin 2005-2009

7 Gradient of M(C)AP MLE vs MAP

• Maximum conditional likelihood estimate

• Maximum conditional a posteriori estimate

44 45 ©Carlos Guestrin 2005-2009 ©Carlos Guestrin 2005-2009

Logistic regression v. Naïve Bayes Derive form for P(Y|X) for continuous Xi

• Consider learning f: X  Y, where – X is a vector of real‐valued features, < X1 … Xn > – Y is boolean • Could use a Gaussian Naïve Bayes classifier

– assume all Xi are conditionally independent given Y – model P(Xi | Y = yk) as Gaussian N(ik,i) – model P(Y) as Bernoulli(,1‐) • What does that imply about the form of P(Y|X)?

46 ©Carlos Guestrin 2005-2009 47 Cool!!!! ©Carlos Guestrin 2005-2009

Ratio of class‐conditional probabilities Derive form for P(Y|X) for continuous Xi

©Carlos Guestrin 2005-2009 48 ©Carlos Guestrin 2005-2009 49

8 Gaussian Naïve Bayes v. Logistic Regression Naïve Bayes vs Logistic Regression

Consider Y boolean, Xi continuous, X= Set of Gaussian Set of Logistic Naïve Bayes parameters Regression parameters (feature variance Number of parameters: independent of class label) • NB: 4n +1 • LR: n+1

• Representation equivalence Estimation method: – But only in a special case!!! (GNB with class‐independent ) • NB parameter estimates are uncoupled • But what’s the difference??? • LR makes no assumptions about P(X|Y) in learning!!! • LR parameter estimates are coupled • Loss function!!! – Optimize different functions ! Obtain different solutions

50 51 ©Carlos Guestrin 2005-2009 ©Carlos Guestrin 2005-2009

G. Naïve Bayes vs. Logistic Regression 1 G. Naïve Bayes vs. Logistic Regression 2

[Ng & Jordan, 2002] [Ng & Jordan, 2002] • Generative and Discriminative classifiers • Generative and Discriminative classifiers

• Asymptotic comparison (# training examples  • Non‐asymptotic analysis infinity) – convergence rate of parameter estimates, n = # of attributes in X • Size of training data to get close to infinite data solution – when model correct • GNB needs O(log n) samples • GNB, LR produce identical classifiers • LR needs O(n) samples

– when model incorrect • LR is less biased –does not assume conditional independence – GNB converges more quickly to its (perhaps less helpful) – therefore LR expected to outperform GNB asymptotic estimates 52 53 ©Carlos Guestrin 2005-2009 ©Carlos Guestrin 2005-2009

Naïve bayes What you should know about Logistic Logistic Regression Regression (LR) • Gaussian Naïve Bayes with class‐independent variances representationally equivalent to LR – Solution differs because of objective (loss) function • In general, NB and LR make different assumptions – NB: Features independent given class ! assumption on P(X|Y) – LR: Functional form of P(Y|X), no assumption on P(X|Y) • LR is a – decision rule is a hyperplane Some • LR optimized by conditional likelihood – no closed‐form solution experiments – concave ! global optimum with gradient ascent – Maximum conditional a posteriori corresponds to regularization from UCI data • Convergence rates – GNB (usually) needs less data sets – LR (usually) gets to better solutions in the limit

©Carlos Guestrin 2005-2009 54 55 ©Carlos Guestrin 2005-2009

9