Last Time Today Bayesian Learning Naïve Bayes the Distributions We
Total Page:16
File Type:pdf, Size:1020Kb
Last Time CSE 446 • Learning Gaussians Gaussian Naïve Bayes & • Naïve Bayes Logistic Regression Winter 2012 Today • Gaussians Naïve Bayes Dan Weld • Logistic Regression Some slides from Carlos Guestrin, Luke Zettlemoyer 2 Text Classification Bayesian Learning Bag of Words Representation Prior aardvark 0 Use Bayes rule: Data Likelihood about 2 all 2 Africa 1 Posterior P(Y | X) = P(X |Y) P(Y) apple 0 P(X) anxious 0 ... Normalization gas 1 ... oil 1 Or equivalently: P(Y | X) P(X | Y) P(Y) … Zaire 0 Naïve Bayes The Distributions We Love Discrete Continuous • Naïve Bayes assumption: – Features are independent given class: Binary {0, 1} k Values Single Bernouilli Event – More generally: Sequence Binomial Multinomial (N trials) N= H+T • How many parameters now? • Suppose X is composed of n binary features Conjugate Beta Dirichlet Prior 6 1 NB with Bag of Words for Text Classification Easy to Implement • Learning phase: – Prior P(Ym) • But… • Count how many documents from topic m / total # docs – P(Xi|Ym) • Let Bm be a bag of words formed from all the docs in topic m • Let #(i, B) be the number of times word iis in bag B • If you do… it probably won’t work… • P(Xi | Ym) = (#(i, Bm)+1) / (W+j#(j, Bm)) where W=#unique words • Test phase: – For each document • Use naïve Bayes decision rule 8 Probabilities: Important Detail! Naïve Bayes Posterior Probabilities • Classification results of naïve Bayes • P(spam | X … X ) = P(spam | X ) 1 n i i – I.e. the class with maximum posterior probability… Any more potential problems here? – Usually fairly accurate (?!?!?) • However, due to the inadequacy of the . We are multiplying lots of small numbers conditional independence assumption… Danger of underflow! – Actual posterior‐probability estimates not accurate. 0.557 = 7 E -18 – Output probabilities generally very close to 0 or 1. Solution? Use logs and add! log(p1)+log(p2) . p1 * p2 = e . Always keep in log form 10 Twenty News Groups results Learning curve for Twenty News Groups 2 Bayesian Learning Bayesian Learning What if Features are Continuous? What if Features are Continuous? Prior Eg., Character Recognition: Eg., Character Recognition: th th Xi is i pixel Xi is i pixel Posterior P(Y | X) P(X | Y) P(Y) P(Y | X) P(X | Y) P(Y) P(Xi =x| Y=yk) = N(ik, ik) Data Likelihood N(ik, ik) = Gaussian Naïve Bayes Learning Gaussian Parameters Sometimes Assume Variance Maximum Likelihood Estimates: – is independent of Y (i.e., i), • Mean: – or independent of Xi (i.e., k) – or both (i.e., ) P(Y | X) P(X | Y) P(Y) • Variance: P(Xi =x| Y=yk) = N(ik, ik) N(ik, ik) = Learning Gaussian Parameters Learning Gaussian Parameters Maximum Likelihood Estimates: Maximum Likelihood Estimates: • Mean: jth training • Mean: example • Variance: xif x true, • Variance: else 3 Example: GNB for classifying mental states [Mitchell et al.] ~1 mm resolution ~2 images per sec. Brain scans can track activation 15,000 voxels/image with precision non-invasive, safe and sensitivity 10 sec measures Blood Oxygen Level Dependent (BOLD) Typical response impulse response [Mitchell et al.] Gaussian Naïve Bayes: Learned voxel,word Gaussian Naïve Bayes: Learned voxel,word P(BrainActivity | WordCategory = {People,Animal}) P(BrainActivity | WordCategory = {People,Animal}) [Mitchell et al.] [Mitchell et al.] Pairwise classification accuracy: 85% People words Animal words What You Need to Know about What’s (supervised) learning more formally Naïve Bayes • Given: • Optimal Decision using Bayes Classifier – Dataset: Instances {x1;t(x1),…, xN;t(xN)} • Naïve Bayes Classifier • e.g., xi;t(xi) = (GPA=3.9,IQ=120,MLscore=99);150K – What’s the assumption – Hypothesis space: H – Why we use it • e.g., polynomials of degree 8 – How do we learn it – Loss function: measures quality of hypothesis hH • Text Classification • e.g., squared error for regression – Bag of words model • Obtain: • Gaussian NB – Learning algorithm: obtain hH that minimizes loss function • e.g., using closed form solution if available – Features still conditionally independent • Or greedy search if not – Features have Gaussian distribution given class • Want to minimize prediction error, but can only minimize error in dataset 24 4 Types of (supervised) learning problems, Learning is (simply) function revisited approximation! • Decision Trees, e.g., • The general (supervised) learning problem: – dataset: votes; party – Given some data (including features), hypothesis space, loss function – hypothesis space: – Learning is no magic! – Loss function: – Simply trying to find a function that fits the data • NB Classification, e.g., • Regression – dataset: brain image; {verb v. noun} – hypothesis space: • Density estimation – Loss function: • Density estimation, e.g., • Classification – dataset: grades – hypothesis space: • (Not surprisingly) Seemly different problem, very similar – Loss function: solutions… 25 26 ©Carlos Guestrin 2005 2009 What you need to know about Generative vs. Discriminative supervised learning Classifiers • Want to Learn: h:X Y • Learning is function approximation – X –features – Y –target classes • Bayes optimal classifier –P(Y|X) • Generative classifier, e.g., Naïve Bayes: • What functions are being optimized? – Assume some functional form for P(X|Y), P(Y) – Estimate parameters of P(X|Y), P(Y) directly from training data – Use Bayes rule to calculate P(Y|X= x) – This is a ‘generative’ model • Indirect computation of P(Y|X) through Bayes rule • As a result, can also generate a sample of the data, P(X) = y P(y) P(X|y) • Discriminative classifiers, e.g., Logistic Regression: – Assume some functional form for P(Y|X) – Estimate parameters of P(Y|X) directly from training data – This is the ‘discriminative’ model • Directly learn P(Y|X) • But cannot obtain a sample of the data, because P(X) is not available 27 28 Logistic Regression Logistic Regression Learn P(Y|X) directly! Learn P(Y|X) directly! Assume a particular functional form Assume a particular functional form Not differentiable… Logistic Function Aka Sigmoid P(Y)=0 P(Y)=1 29 30 5 Logistic Function in n Dimensions Understanding Sigmoids 1 Sigmoid applied to a linear function of the data: 0.9 0.8 0.7 0.6 w0=-2, w1=-1 0.5 0.4 0.3 1 0.2 0.1 0.8 0 -6 -4 -2 0 2 4 6 0.6 1 1 0.9 0.9 0.4 0.8 0.8 0.7 0.7 0.2 -4 0.6 0.6 0-2 0.5 0.5 0 2 0.4 0.4 -2 x 0.3 0.3 0 6 4 2 2 4 8 0.2 0.2 x1 6 10 0.1 0.1 0 0 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6 w =0, w =-1 w =0, w = -0.5 Features can be discrete or continuous! 31 0 1 0 1 Very convenient! Loss functions: Likelihood vs. Conditional Likelihood • Generative (Naïve Bayes) Loss function: Data likelihood implies implies • Discriminative models cannot compute P(xj|w)! • But, discriminative (logistic regression) loss function: Conditional Data Likelihood linear classification implies rule! – Doesn’t waste effort learning P(X) –focuses on P(Y|X) all that matters for classification 33 35 ©Carlos Guestrin 2005-2009 ©Carlos Guestrin 2005-2009 Expressing Conditional Log Likelihood Maximizing Conditional Log Likelihood Good news: l(w) is concave function of w ! no locally optimal solutions Bad news: no closed-form solution to maximize l(w) Good news: concave functions easy to optimize ©Carlos Guestrin 2005-2009 36 ©Carlos Guestrin 2005-2009 37 6 Optimizing concave function – Maximize Conditional Log Likelihood: Gradient Gradient ascent ascent • Conditional likelihood for Logistic Regression is concave ! Find optimum with gradient ascent Gradient: Learning rate , >0 Update rule: • Gradient ascent is simplest of optimization approaches – e.g., Conjugate gradient ascent much better (see reading) 38 ©Carlos Guestrin 2005-2009 39 ©Carlos Guestrin 2005-2009 Gradient Descent for LR That’s all M(C)LE. How about MAP? • One common approach is to define priors on w Gradient ascent algorithm: iterate until change < – Normal distribution, zero mean, identity covariance – “Pushes” parameters towards zero • Corresponds to Regularization – Helps avoid very large weights and overfitting – More on this later in the semester For i=1,…,n, • MAP estimate repeat 40 41 ©Carlos Guestrin 2005-2009 ©Carlos Guestrin 2005-2009 M(C)AP as Regularization Large parameters Overfitting • If data is linearly separable, weights go to infinity • Leads to overfitting: • Penalizing high weights can prevent overfitting… – again, more on this later in the semester Penalizes high weights, also applicable42 in linear regression 43 ©Carlos Guestrin 2005-2009 ©Carlos Guestrin 2005-2009 7 Gradient of M(C)AP MLE vs MAP • Maximum conditional likelihood estimate • Maximum conditional a posteriori estimate 44 45 ©Carlos Guestrin 2005-2009 ©Carlos Guestrin 2005-2009 Logistic regression v. Naïve Bayes Derive form for P(Y|X) for continuous Xi • Consider learning f: X Y, where – X is a vector of real‐valued features, < X1 … Xn > – Y is boolean • Could use a Gaussian Naïve Bayes classifier – assume all Xi are conditionally independent given Y – model P(Xi | Y = yk) as Gaussian N(ik,i) – model P(Y) as Bernoulli(,1‐) • What does that imply about the form of P(Y|X)? 46 ©Carlos Guestrin 2005-2009 47 Cool!!!! ©Carlos Guestrin 2005-2009 Ratio of class‐conditional probabilities Derive form for P(Y|X) for continuous Xi ©Carlos Guestrin 2005-2009 48 ©Carlos Guestrin 2005-2009 49 8 Gaussian Naïve Bayes v. Logistic Regression Naïve Bayes vs Logistic Regression Consider Y boolean, Xi continuous, X=<X1 ... Xn> Set of Gaussian Set of Logistic Naïve Bayes parameters Regression parameters (feature variance Number of parameters: independent of class label) • NB: 4n +1 • LR: n+1 • Representation equivalence Estimation method: – But only in a special case!!! (GNB with class‐independent variances) • NB parameter estimates are uncoupled • But what’s the difference??? • LR makes no assumptions about P(X|Y) in learning!!! • LR parameter estimates are coupled • Loss function!!! – Optimize different functions ! Obtain different solutions 50 51 ©Carlos Guestrin 2005-2009 ©Carlos Guestrin 2005-2009 G.