Generative Vs. Discriminative Classifiers Naïve Bayes Vs Logistic
Total Page:16
File Type:pdf, Size:1020Kb
Generative vs. Discriminative Classifiers z Goal: Wish to learn f: X → Y, e.g., P(Y|X) z Generative classifiers (e.g., Naïve Bayes): Yi z Assume some functional form for P(X|Y), P(Y) This is a ‘generative’ model of the data! Xi z Estimate parameters of P(X|Y), P(Y) directly from training data z Use Bayes rule to calculate P(Y|X= x) z Discriminative classifiers: Yi z Directly assume some functional form for P(Y|X) This is a ‘discriminative’ model of the data! Xi z Estimate parameters of P(Y|X) directly from training data © Eric Xing @ CMU, 2006-2011 1 Naïve Bayes vs Logistic Regression z Consider Y boolean, X continuous, X=<X1 ... Xm> z Number of parameters to estimate: ⎪⎧ ⎛ 1 ⎞⎪⎫ π exp − ⎜ (x − µ )2 − logσ − C ⎟ k ⎨ ∑ j ⎜ 2 j k , j k , j ⎟⎬ ⎪ ⎝ 2σ k , j ⎠⎪ NB: p(y | x) = ⎩ ⎭ ** ⎧ ⎛ ⎞⎫ ⎪ ⎜ 1 2 ⎟⎪ π k ' exp⎨− (x j − µk ', j ) − logσ k ', j − C ⎬ ∑∑kj' ⎜ 2σ 2 ⎟ ⎩⎪ ⎝ k ', j ⎠⎭⎪ LR: 1 µ(x) = T 1 + e−θ x z Estimation method: z NB parameter estimates are uncoupled z LR parameter estimates are coupled © Eric Xing @ CMU, 2006-2011 2 1 Naïve Bayes vs Logistic Regression z Asymptotic comparison (# training examples → infinity) z when model assumptions correct z NB, LR produce identical classifiers z when model assumptions incorrect z LR is less biased – does not assume conditional independence z therefore expected to outperform NB © Eric Xing @ CMU, 2006-2011 3 Naïve Bayes vs Logistic Regression z Non-asymptotic analysis (see [Ng & Jordan, 2002] ) z convergence rate of parameter estimates – how many training examples needed to assure good estimates? NB order log m (where m = # of attributes in X) LR order m z NB converges more quickly to its (perhaps less helpful) asymptotic estimates © Eric Xing @ CMU, 2006-2011 4 2 Rate of convergence: logistic regression z Let hDis,m be logistic regression trained on n examples in m dimensions. Then with higgph probability : z Implication: if we want for some small constant ε0, it suffices to pick order m examples Æ Convergences to its asymptotic classifier, in order m examples z result follows from Vapnik’s structural risk bound, plus fact that the "VC Dimension" of an m-dimensional linear separators is m © Eric Xing @ CMU, 2006-2011 5 Rate of convergence: naïve Bayes parameters z Let any ε1, δ>0, and any n ≥ 0 be fixed. Assume that for some fixed ρ0 >0> 0, we have that z Let z Then with ppyrobability at least 1-δ, after n examples: 1. For discrete input, for all i and b 2. For continuous inputs, for all i and b © Eric Xing @ CMU, 2006-2011 6 3 Some experiments from UCI data sets © Eric Xing @ CMU, 2006-2011 7 Summary z Naïve Bayes classifier z What’s the assumption z Why we use it z How do we learn it z Logistic regression z Functional form follows from Naïve Bayes assumptions z For Gaussian Naïve Bayes assuming variance z For discrete-valued Naïve Bayes too z But training procedure picks parameters without the conditional independence assumption z Gradient ascent/descent z – General approach when closed-form solutions unavailable z Generative vs. Discriminative classifiers z – Bias vs. variance tradeoff © Eric Xing @ CMU, 2006-2011 8 4 Machine Learning 1010--701/15701/15--781,781, Fall 2011 Linear Regression and Sparsity Eric Xing Lecture 4, September 21, 2011 Reading: © Eric Xing @ CMU, 2006-2011 9 Machine learning for apartment hunting z Now you've moved to Pittsburgh!! And you want to find the most reasonably priced apartment satisfying your needs: square-ft., # of bedroom, distance to campus … Living area (ft2) # bedroom Rent ($) 230 1 600 506 2 1000 433 2 1100 109 1 500 … 150 1 ? 270 1.5 ? © Eric Xing @ CMU, 2006-2011 10 5 The learning problem z Features: t z Living area, distance to campus, # bedroom … ren z Denote as x=[x1, x2, … xk] z Target: z Rent Living area z Denoted as y z Training set: 1 2 k ⎡− − x1 − −⎤ ⎡x1 x1 K x1 ⎤ ⎢ ⎥ ⎢ 1 2 k ⎥ − − x − − x x x X = ⎢ 2 ⎥ = ⎢ 2 2 K 2 ⎥ rent ⎢ M M M ⎥ ⎢ M M M M ⎥ ⎢ ⎥ ⎢ 1 2 k ⎥ ⎣− − xn − −⎦ ⎣⎢xn xn K xn ⎦⎥ ⎡ y1 ⎤ ⎡− y1 −⎤ ⎢y ⎥ ⎢− y −⎥ Y = ⎢ 2 ⎥ or ⎢ 2 ⎥ Location ⎢ M ⎥ ⎢M M M ⎥ ⎢ ⎥ ⎢ ⎥ ⎣yn ⎦ ⎣− y n −⎦ Living area © Eric Xing @ CMU, 2006-2011 11 Linear Regression z Assume that Y (target) is a linear function of X (features): z e.g.: 1 2 yˆ = θ0 +θ1x +θ2 x z let's assume a vacuous "feature" X0=1 (this is the intercept term, why?), and define the feature vector to be: z then we have the following general representation of the linear function: z Our goal is to pick the optimal θ . How! z We seek θ that minimize the following cost function: 1 n ˆ v 2 J (θ ) = ∑(yi (xi ) − yi ) 2 i=1 © Eric Xing @ CMU, 2006-2011 12 6 The Least-Mean-Square (LMS) method z The Cost Function: n 1 T 2 J (θ ) = ∑(xi θ − yi ) 2 i=1 z Consider a gradient descent algorithm: ∂ θ t+1 = θ t −α J (θ ) j j ∂θ j t © Eric Xing @ CMU, 2006-2011 13 The Least-Mean-Square (LMS) method z Now we have the following descent rule: n t+1 t v T t j θ j = θ j +α∑(yi − xi θ )xi i=1 z For a single training point, we have: z This is known as the LMS update rule, or the Widrow-Hoff learning rule z This is actually a "stochastic", "coordinate" descent algorithm z This can be used as a on-line algorithm © Eric Xing @ CMU, 2006-2011 14 7 Geometric and Convergence of LMS N=1 N=2 N=3 t+1 t v T t v θ = θ +α(yi − xi θ )xi Claim: when the step size α satisfies certain condition, and when certain other technical conditions are satisfied, LMS will converge to an “optimal region”. © Eric Xing @ CMU, 2006-2011 15 Steepest Descent and LMS z Steepest descent z Note that: T n ⎡ ∂ ∂ ⎤ T ∇θ J = ⎢ J,K, J ⎥ = −∑(yn − xn θ )xn ⎣∂θ1 ∂θk ⎦ i=1 n t+1 t T t θ = θ +α∑(yn − xn θ )xn i=1 z This is as a batch gradient descent algorithm © Eric Xing @ CMU, 2006-2011 16 8 The normal equations z Write the cost function in matrix form: ⎡− − x1 − −⎤ n 1 T 2 ⎢ ⎥ − − x2 − − J(θ ) = ∑(xi θ − yi ) X = ⎢ ⎥ 2 i=1 ⎢ M M M ⎥ ⎢ ⎥ 1 T − − x − − = ()()Xθ − yv Xθ − yv ⎣ n ⎦ 2 ⎡ y1 ⎤ ⎢ ⎥ 1 T T T T T T y = ()θ X Xθ −θ X yv − yv Xθ + yv yv yv = ⎢ 2 ⎥ 2 ⎢ M ⎥ ⎢ ⎥ ⎣yn ⎦ z To minimize J(θ), take derivative and set to zero: 1 T T T T v vT vT v ∇θ J = ∇θ tr()θ X Xθ −θ X y − y Xθ + y y 2 ⇒ X T Xθ = X T yv 1 = ()∇ trθ T X T Xθ −2∇ tryvT Xθ + ∇ tryvT yv 2 θ θ θ The normal equations 1 = ()X T Xθ + X T Xθ −2X T yv 2 ⇓ −1 = X T Xθ − X T yv = 0 θ * = (X T X ) X T yv © Eric Xing @ CMU, 2006-2011 17 Some matrix derivatives m×n z For f : R a R , define: ⎡ ∂ ∂ ⎤ f L f ⎢ ∂A ∂A ⎥ ⎢ 11 1n ⎥ ⎢ ⎥ ∇ A f (A) = M O M ⎢ ∂ ∂ ⎥ ⎢ f L f ⎥ ⎢∂A1m ∂Amn ⎥ ⎣⎢ ⎦⎥ z Trace: n tra = a , trA = ∑ Aii , trABC = trCAB = trBCA i=1 z Some fact of matrix derivatives (without proof) T T T T −1 T ∇ AtrAB = B , ∇ AtrABA C = CAB + C AB , ∇ A A = A(A ) © Eric Xing @ CMU, 2006-2011 18 9 Comments on the normal equation z In most situations of practical interest, the number of data points N is larggyer than the dimensionality k of the inppput space and the matrix X is of full column rank. If this condition holds, then it is easy to verify that XTX is necessarily invertible. z The assumption that XTX is invertible implies that it is positive definite, thus the critical point we have found is a minimum. z What if X has less than full column rank? Æ regularization (later). © Eric Xing @ CMU, 2006-2011 19 Direct and Iterative methods z Direct methods: we can achieve the solution in a single step byyg solving the normal eq uation z Using Gaussian elimination or QR decomposition, we converge in a finite number of steps z It can be infeasible when data are streaming in in real time, or of very large amount z Iterative methods: stochastic or steepest gradient z Converging in a limiting sense z But more attractive in large practical problems z Caution is needed for deciding the learning rate α © Eric Xing @ CMU, 2006-2011 20 10 Convergence rate z Theorem: the steepest descent equation algorithm converggye to the minimum of the cost characterized by normal equation: If z A formal analysis of LMS need more math-mussels; in practice, one can use a small α, or gradually decrease α. © Eric Xing @ CMU, 2006-2011 21 A Summary: z LMS update rule t+1 t T t θ j = θ j +α(yn − xn θ )xn,i z Pros: on-line, low per-step cost, fast convergence and perhaps less prone to local optimum z Cons: convergence to optimum not always guaranteed z Steepest descent n t+1 t T t θ = θ +α∑(yn − xn θ )xn i=1 z Pros: easy to implement, conceptually clean, guaranteed convergence z Cons: batch, often slow converging z Normal equations − θ * = (X T X ) 1 X T yv z Pros: a single-shot algorithm! Easiest to implement.