Generative Vs. Discriminative Classifiers Naïve Bayes Vs Logistic

Generative Vs. Discriminative Classifiers Naïve Bayes Vs Logistic

Generative vs. Discriminative Classifiers z Goal: Wish to learn f: X → Y, e.g., P(Y|X) z Generative classifiers (e.g., Naïve Bayes): Yi z Assume some functional form for P(X|Y), P(Y) This is a ‘generative’ model of the data! Xi z Estimate parameters of P(X|Y), P(Y) directly from training data z Use Bayes rule to calculate P(Y|X= x) z Discriminative classifiers: Yi z Directly assume some functional form for P(Y|X) This is a ‘discriminative’ model of the data! Xi z Estimate parameters of P(Y|X) directly from training data © Eric Xing @ CMU, 2006-2011 1 Naïve Bayes vs Logistic Regression z Consider Y boolean, X continuous, X=<X1 ... Xm> z Number of parameters to estimate: ⎪⎧ ⎛ 1 ⎞⎪⎫ π exp − ⎜ (x − µ )2 − logσ − C ⎟ k ⎨ ∑ j ⎜ 2 j k , j k , j ⎟⎬ ⎪ ⎝ 2σ k , j ⎠⎪ NB: p(y | x) = ⎩ ⎭ ** ⎧ ⎛ ⎞⎫ ⎪ ⎜ 1 2 ⎟⎪ π k ' exp⎨− (x j − µk ', j ) − logσ k ', j − C ⎬ ∑∑kj' ⎜ 2σ 2 ⎟ ⎩⎪ ⎝ k ', j ⎠⎭⎪ LR: 1 µ(x) = T 1 + e−θ x z Estimation method: z NB parameter estimates are uncoupled z LR parameter estimates are coupled © Eric Xing @ CMU, 2006-2011 2 1 Naïve Bayes vs Logistic Regression z Asymptotic comparison (# training examples → infinity) z when model assumptions correct z NB, LR produce identical classifiers z when model assumptions incorrect z LR is less biased – does not assume conditional independence z therefore expected to outperform NB © Eric Xing @ CMU, 2006-2011 3 Naïve Bayes vs Logistic Regression z Non-asymptotic analysis (see [Ng & Jordan, 2002] ) z convergence rate of parameter estimates – how many training examples needed to assure good estimates? NB order log m (where m = # of attributes in X) LR order m z NB converges more quickly to its (perhaps less helpful) asymptotic estimates © Eric Xing @ CMU, 2006-2011 4 2 Rate of convergence: logistic regression z Let hDis,m be logistic regression trained on n examples in m dimensions. Then with higgph probability : z Implication: if we want for some small constant ε0, it suffices to pick order m examples Æ Convergences to its asymptotic classifier, in order m examples z result follows from Vapnik’s structural risk bound, plus fact that the "VC Dimension" of an m-dimensional linear separators is m © Eric Xing @ CMU, 2006-2011 5 Rate of convergence: naïve Bayes parameters z Let any ε1, δ>0, and any n ≥ 0 be fixed. Assume that for some fixed ρ0 >0> 0, we have that z Let z Then with ppyrobability at least 1-δ, after n examples: 1. For discrete input, for all i and b 2. For continuous inputs, for all i and b © Eric Xing @ CMU, 2006-2011 6 3 Some experiments from UCI data sets © Eric Xing @ CMU, 2006-2011 7 Summary z Naïve Bayes classifier z What’s the assumption z Why we use it z How do we learn it z Logistic regression z Functional form follows from Naïve Bayes assumptions z For Gaussian Naïve Bayes assuming variance z For discrete-valued Naïve Bayes too z But training procedure picks parameters without the conditional independence assumption z Gradient ascent/descent z – General approach when closed-form solutions unavailable z Generative vs. Discriminative classifiers z – Bias vs. variance tradeoff © Eric Xing @ CMU, 2006-2011 8 4 Machine Learning 1010--701/15701/15--781,781, Fall 2011 Linear Regression and Sparsity Eric Xing Lecture 4, September 21, 2011 Reading: © Eric Xing @ CMU, 2006-2011 9 Machine learning for apartment hunting z Now you've moved to Pittsburgh!! And you want to find the most reasonably priced apartment satisfying your needs: square-ft., # of bedroom, distance to campus … Living area (ft2) # bedroom Rent ($) 230 1 600 506 2 1000 433 2 1100 109 1 500 … 150 1 ? 270 1.5 ? © Eric Xing @ CMU, 2006-2011 10 5 The learning problem z Features: t z Living area, distance to campus, # bedroom … ren z Denote as x=[x1, x2, … xk] z Target: z Rent Living area z Denoted as y z Training set: 1 2 k ⎡− − x1 − −⎤ ⎡x1 x1 K x1 ⎤ ⎢ ⎥ ⎢ 1 2 k ⎥ − − x − − x x x X = ⎢ 2 ⎥ = ⎢ 2 2 K 2 ⎥ rent ⎢ M M M ⎥ ⎢ M M M M ⎥ ⎢ ⎥ ⎢ 1 2 k ⎥ ⎣− − xn − −⎦ ⎣⎢xn xn K xn ⎦⎥ ⎡ y1 ⎤ ⎡− y1 −⎤ ⎢y ⎥ ⎢− y −⎥ Y = ⎢ 2 ⎥ or ⎢ 2 ⎥ Location ⎢ M ⎥ ⎢M M M ⎥ ⎢ ⎥ ⎢ ⎥ ⎣yn ⎦ ⎣− y n −⎦ Living area © Eric Xing @ CMU, 2006-2011 11 Linear Regression z Assume that Y (target) is a linear function of X (features): z e.g.: 1 2 yˆ = θ0 +θ1x +θ2 x z let's assume a vacuous "feature" X0=1 (this is the intercept term, why?), and define the feature vector to be: z then we have the following general representation of the linear function: z Our goal is to pick the optimal θ . How! z We seek θ that minimize the following cost function: 1 n ˆ v 2 J (θ ) = ∑(yi (xi ) − yi ) 2 i=1 © Eric Xing @ CMU, 2006-2011 12 6 The Least-Mean-Square (LMS) method z The Cost Function: n 1 T 2 J (θ ) = ∑(xi θ − yi ) 2 i=1 z Consider a gradient descent algorithm: ∂ θ t+1 = θ t −α J (θ ) j j ∂θ j t © Eric Xing @ CMU, 2006-2011 13 The Least-Mean-Square (LMS) method z Now we have the following descent rule: n t+1 t v T t j θ j = θ j +α∑(yi − xi θ )xi i=1 z For a single training point, we have: z This is known as the LMS update rule, or the Widrow-Hoff learning rule z This is actually a "stochastic", "coordinate" descent algorithm z This can be used as a on-line algorithm © Eric Xing @ CMU, 2006-2011 14 7 Geometric and Convergence of LMS N=1 N=2 N=3 t+1 t v T t v θ = θ +α(yi − xi θ )xi Claim: when the step size α satisfies certain condition, and when certain other technical conditions are satisfied, LMS will converge to an “optimal region”. © Eric Xing @ CMU, 2006-2011 15 Steepest Descent and LMS z Steepest descent z Note that: T n ⎡ ∂ ∂ ⎤ T ∇θ J = ⎢ J,K, J ⎥ = −∑(yn − xn θ )xn ⎣∂θ1 ∂θk ⎦ i=1 n t+1 t T t θ = θ +α∑(yn − xn θ )xn i=1 z This is as a batch gradient descent algorithm © Eric Xing @ CMU, 2006-2011 16 8 The normal equations z Write the cost function in matrix form: ⎡− − x1 − −⎤ n 1 T 2 ⎢ ⎥ − − x2 − − J(θ ) = ∑(xi θ − yi ) X = ⎢ ⎥ 2 i=1 ⎢ M M M ⎥ ⎢ ⎥ 1 T − − x − − = ()()Xθ − yv Xθ − yv ⎣ n ⎦ 2 ⎡ y1 ⎤ ⎢ ⎥ 1 T T T T T T y = ()θ X Xθ −θ X yv − yv Xθ + yv yv yv = ⎢ 2 ⎥ 2 ⎢ M ⎥ ⎢ ⎥ ⎣yn ⎦ z To minimize J(θ), take derivative and set to zero: 1 T T T T v vT vT v ∇θ J = ∇θ tr()θ X Xθ −θ X y − y Xθ + y y 2 ⇒ X T Xθ = X T yv 1 = ()∇ trθ T X T Xθ −2∇ tryvT Xθ + ∇ tryvT yv 2 θ θ θ The normal equations 1 = ()X T Xθ + X T Xθ −2X T yv 2 ⇓ −1 = X T Xθ − X T yv = 0 θ * = (X T X ) X T yv © Eric Xing @ CMU, 2006-2011 17 Some matrix derivatives m×n z For f : R a R , define: ⎡ ∂ ∂ ⎤ f L f ⎢ ∂A ∂A ⎥ ⎢ 11 1n ⎥ ⎢ ⎥ ∇ A f (A) = M O M ⎢ ∂ ∂ ⎥ ⎢ f L f ⎥ ⎢∂A1m ∂Amn ⎥ ⎣⎢ ⎦⎥ z Trace: n tra = a , trA = ∑ Aii , trABC = trCAB = trBCA i=1 z Some fact of matrix derivatives (without proof) T T T T −1 T ∇ AtrAB = B , ∇ AtrABA C = CAB + C AB , ∇ A A = A(A ) © Eric Xing @ CMU, 2006-2011 18 9 Comments on the normal equation z In most situations of practical interest, the number of data points N is larggyer than the dimensionality k of the inppput space and the matrix X is of full column rank. If this condition holds, then it is easy to verify that XTX is necessarily invertible. z The assumption that XTX is invertible implies that it is positive definite, thus the critical point we have found is a minimum. z What if X has less than full column rank? Æ regularization (later). © Eric Xing @ CMU, 2006-2011 19 Direct and Iterative methods z Direct methods: we can achieve the solution in a single step byyg solving the normal eq uation z Using Gaussian elimination or QR decomposition, we converge in a finite number of steps z It can be infeasible when data are streaming in in real time, or of very large amount z Iterative methods: stochastic or steepest gradient z Converging in a limiting sense z But more attractive in large practical problems z Caution is needed for deciding the learning rate α © Eric Xing @ CMU, 2006-2011 20 10 Convergence rate z Theorem: the steepest descent equation algorithm converggye to the minimum of the cost characterized by normal equation: If z A formal analysis of LMS need more math-mussels; in practice, one can use a small α, or gradually decrease α. © Eric Xing @ CMU, 2006-2011 21 A Summary: z LMS update rule t+1 t T t θ j = θ j +α(yn − xn θ )xn,i z Pros: on-line, low per-step cost, fast convergence and perhaps less prone to local optimum z Cons: convergence to optimum not always guaranteed z Steepest descent n t+1 t T t θ = θ +α∑(yn − xn θ )xn i=1 z Pros: easy to implement, conceptually clean, guaranteed convergence z Cons: batch, often slow converging z Normal equations − θ * = (X T X ) 1 X T yv z Pros: a single-shot algorithm! Easiest to implement.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    30 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us