Last Time Today Bayesian Learning Naïve Bayes the Distributions We

Last Time Today Bayesian Learning Naïve Bayes the Distributions We

Last Time CSE 446 • Learning Gaussians Gaussian Naïve Bayes & • Naïve Bayes Logistic Regression Winter 2012 Today • Gaussians Naïve Bayes Dan Weld • Logistic Regression Some slides from Carlos Guestrin, Luke Zettlemoyer 2 Text Classification Bayesian Learning Bag of Words Representation Prior aardvark 0 Use Bayes rule: Data Likelihood about 2 all 2 Africa 1 Posterior P(Y | X) = P(X |Y) P(Y) apple 0 P(X) anxious 0 ... Normalization gas 1 ... oil 1 Or equivalently: P(Y | X) P(X | Y) P(Y) … Zaire 0 Naïve Bayes The Distributions We Love Discrete Continuous • Naïve Bayes assumption: – Features are independent given class: Binary {0, 1} k Values Single Bernouilli Event – More generally: Sequence Binomial Multinomial (N trials) N= H+T • How many parameters now? • Suppose X is composed of n binary features Conjugate Beta Dirichlet Prior 6 1 NB with Bag of Words for Text Classification Easy to Implement • Learning phase: – Prior P(Ym) • But… • Count how many documents from topic m / total # docs – P(Xi|Ym) • Let Bm be a bag of words formed from all the docs in topic m • Let #(i, B) be the number of times word iis in bag B • If you do… it probably won’t work… • P(Xi | Ym) = (#(i, Bm)+1) / (W+j#(j, Bm)) where W=#unique words • Test phase: – For each document • Use naïve Bayes decision rule 8 Probabilities: Important Detail! Naïve Bayes Posterior Probabilities • Classification results of naïve Bayes • P(spam | X … X ) = P(spam | X ) 1 n i i – I.e. the class with maximum posterior probability… Any more potential problems here? – Usually fairly accurate (?!?!?) • However, due to the inadequacy of the . We are multiplying lots of small numbers conditional independence assumption… Danger of underflow! – Actual posterior‐probability estimates not accurate. 0.557 = 7 E -18 – Output probabilities generally very close to 0 or 1. Solution? Use logs and add! log(p1)+log(p2) . p1 * p2 = e . Always keep in log form 10 Twenty News Groups results Learning curve for Twenty News Groups 2 Bayesian Learning Bayesian Learning What if Features are Continuous? What if Features are Continuous? Prior Eg., Character Recognition: Eg., Character Recognition: th th Xi is i pixel Xi is i pixel Posterior P(Y | X) P(X | Y) P(Y) P(Y | X) P(X | Y) P(Y) P(Xi =x| Y=yk) = N(ik, ik) Data Likelihood N(ik, ik) = Gaussian Naïve Bayes Learning Gaussian Parameters Sometimes Assume Variance Maximum Likelihood Estimates: – is independent of Y (i.e., i), • Mean: – or independent of Xi (i.e., k) – or both (i.e., ) P(Y | X) P(X | Y) P(Y) • Variance: P(Xi =x| Y=yk) = N(ik, ik) N(ik, ik) = Learning Gaussian Parameters Learning Gaussian Parameters Maximum Likelihood Estimates: Maximum Likelihood Estimates: • Mean: jth training • Mean: example • Variance: xif x true, • Variance: else 3 Example: GNB for classifying mental states [Mitchell et al.] ~1 mm resolution ~2 images per sec. Brain scans can track activation 15,000 voxels/image with precision non-invasive, safe and sensitivity 10 sec measures Blood Oxygen Level Dependent (BOLD) Typical response impulse response [Mitchell et al.] Gaussian Naïve Bayes: Learned voxel,word Gaussian Naïve Bayes: Learned voxel,word P(BrainActivity | WordCategory = {People,Animal}) P(BrainActivity | WordCategory = {People,Animal}) [Mitchell et al.] [Mitchell et al.] Pairwise classification accuracy: 85% People words Animal words What You Need to Know about What’s (supervised) learning more formally Naïve Bayes • Given: • Optimal Decision using Bayes Classifier – Dataset: Instances {x1;t(x1),…, xN;t(xN)} • Naïve Bayes Classifier • e.g., xi;t(xi) = (GPA=3.9,IQ=120,MLscore=99);150K – What’s the assumption – Hypothesis space: H – Why we use it • e.g., polynomials of degree 8 – How do we learn it – Loss function: measures quality of hypothesis hH • Text Classification • e.g., squared error for regression – Bag of words model • Obtain: • Gaussian NB – Learning algorithm: obtain hH that minimizes loss function • e.g., using closed form solution if available – Features still conditionally independent • Or greedy search if not – Features have Gaussian distribution given class • Want to minimize prediction error, but can only minimize error in dataset 24 4 Types of (supervised) learning problems, Learning is (simply) function revisited approximation! • Decision Trees, e.g., • The general (supervised) learning problem: – dataset: votes; party – Given some data (including features), hypothesis space, loss function – hypothesis space: – Learning is no magic! – Loss function: – Simply trying to find a function that fits the data • NB Classification, e.g., • Regression – dataset: brain image; {verb v. noun} – hypothesis space: • Density estimation – Loss function: • Density estimation, e.g., • Classification – dataset: grades – hypothesis space: • (Not surprisingly) Seemly different problem, very similar – Loss function: solutions… 25 26 ©Carlos Guestrin 2005 2009 What you need to know about Generative vs. Discriminative supervised learning Classifiers • Want to Learn: h:X Y • Learning is function approximation – X –features – Y –target classes • Bayes optimal classifier –P(Y|X) • Generative classifier, e.g., Naïve Bayes: • What functions are being optimized? – Assume some functional form for P(X|Y), P(Y) – Estimate parameters of P(X|Y), P(Y) directly from training data – Use Bayes rule to calculate P(Y|X= x) – This is a ‘generative’ model • Indirect computation of P(Y|X) through Bayes rule • As a result, can also generate a sample of the data, P(X) = y P(y) P(X|y) • Discriminative classifiers, e.g., Logistic Regression: – Assume some functional form for P(Y|X) – Estimate parameters of P(Y|X) directly from training data – This is the ‘discriminative’ model • Directly learn P(Y|X) • But cannot obtain a sample of the data, because P(X) is not available 27 28 Logistic Regression Logistic Regression Learn P(Y|X) directly! Learn P(Y|X) directly! Assume a particular functional form Assume a particular functional form Not differentiable… Logistic Function Aka Sigmoid P(Y)=0 P(Y)=1 29 30 5 Logistic Function in n Dimensions Understanding Sigmoids 1 Sigmoid applied to a linear function of the data: 0.9 0.8 0.7 0.6 w0=-2, w1=-1 0.5 0.4 0.3 1 0.2 0.1 0.8 0 -6 -4 -2 0 2 4 6 0.6 1 1 0.9 0.9 0.4 0.8 0.8 0.7 0.7 0.2 -4 0.6 0.6 0-2 0.5 0.5 0 2 0.4 0.4 -2 x 0.3 0.3 0 6 4 2 2 4 8 0.2 0.2 x1 6 10 0.1 0.1 0 0 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6 w =0, w =-1 w =0, w = -0.5 Features can be discrete or continuous! 31 0 1 0 1 Very convenient! Loss functions: Likelihood vs. Conditional Likelihood • Generative (Naïve Bayes) Loss function: Data likelihood implies implies • Discriminative models cannot compute P(xj|w)! • But, discriminative (logistic regression) loss function: Conditional Data Likelihood linear classification implies rule! – Doesn’t waste effort learning P(X) –focuses on P(Y|X) all that matters for classification 33 35 ©Carlos Guestrin 2005-2009 ©Carlos Guestrin 2005-2009 Expressing Conditional Log Likelihood Maximizing Conditional Log Likelihood Good news: l(w) is concave function of w ! no locally optimal solutions Bad news: no closed-form solution to maximize l(w) Good news: concave functions easy to optimize ©Carlos Guestrin 2005-2009 36 ©Carlos Guestrin 2005-2009 37 6 Optimizing concave function – Maximize Conditional Log Likelihood: Gradient Gradient ascent ascent • Conditional likelihood for Logistic Regression is concave ! Find optimum with gradient ascent Gradient: Learning rate , >0 Update rule: • Gradient ascent is simplest of optimization approaches – e.g., Conjugate gradient ascent much better (see reading) 38 ©Carlos Guestrin 2005-2009 39 ©Carlos Guestrin 2005-2009 Gradient Descent for LR That’s all M(C)LE. How about MAP? • One common approach is to define priors on w Gradient ascent algorithm: iterate until change < – Normal distribution, zero mean, identity covariance – “Pushes” parameters towards zero • Corresponds to Regularization – Helps avoid very large weights and overfitting – More on this later in the semester For i=1,…,n, • MAP estimate repeat 40 41 ©Carlos Guestrin 2005-2009 ©Carlos Guestrin 2005-2009 M(C)AP as Regularization Large parameters Overfitting • If data is linearly separable, weights go to infinity • Leads to overfitting: • Penalizing high weights can prevent overfitting… – again, more on this later in the semester Penalizes high weights, also applicable42 in linear regression 43 ©Carlos Guestrin 2005-2009 ©Carlos Guestrin 2005-2009 7 Gradient of M(C)AP MLE vs MAP • Maximum conditional likelihood estimate • Maximum conditional a posteriori estimate 44 45 ©Carlos Guestrin 2005-2009 ©Carlos Guestrin 2005-2009 Logistic regression v. Naïve Bayes Derive form for P(Y|X) for continuous Xi • Consider learning f: X Y, where – X is a vector of real‐valued features, < X1 … Xn > – Y is boolean • Could use a Gaussian Naïve Bayes classifier – assume all Xi are conditionally independent given Y – model P(Xi | Y = yk) as Gaussian N(ik,i) – model P(Y) as Bernoulli(,1‐) • What does that imply about the form of P(Y|X)? 46 ©Carlos Guestrin 2005-2009 47 Cool!!!! ©Carlos Guestrin 2005-2009 Ratio of class‐conditional probabilities Derive form for P(Y|X) for continuous Xi ©Carlos Guestrin 2005-2009 48 ©Carlos Guestrin 2005-2009 49 8 Gaussian Naïve Bayes v. Logistic Regression Naïve Bayes vs Logistic Regression Consider Y boolean, Xi continuous, X=<X1 ... Xn> Set of Gaussian Set of Logistic Naïve Bayes parameters Regression parameters (feature variance Number of parameters: independent of class label) • NB: 4n +1 • LR: n+1 • Representation equivalence Estimation method: – But only in a special case!!! (GNB with class‐independent variances) • NB parameter estimates are uncoupled • But what’s the difference??? • LR makes no assumptions about P(X|Y) in learning!!! • LR parameter estimates are coupled • Loss function!!! – Optimize different functions ! Obtain different solutions 50 51 ©Carlos Guestrin 2005-2009 ©Carlos Guestrin 2005-2009 G.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    9 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us