CS 567) Lecture 5

Machine Learning (CS 567) Lecture 5 Fall 2008 Time: T-Th 5:00pm - 6:20pm Location: GFS 118 Instructor: Sofus A. Macskassy ([email protected]) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol Han ([email protected]) Office: SAL 229 Office hours: M 2-3pm, W 11-12 Class web page: http://www-scf.usc.edu/~csci567/index.html Fall 2008 1 Lecture 5 - Sofus A. Macskassy Administrative: Regrading • This regrading policy holds for quizzes, homeworks, midterm and final exam. • Grading complaint policy: if students have problems at assignment grading, feel free to talk to the instructor/TA for it within one week of getting the graded work back. After one week no regrading will be done. • Even if students only request re-grading one of the answers, all this assignment will be checked and graded again. (take the risk of possible losing total points.) Fall 2008 2 Lecture 4 - Sofus A. Macskassy Lecture 5 Outline • Logistic Regression (from last week) • Linear Discriminant Analysis • Off-the-shelf Classifiers Fall 2008 3 Lecture 5 - Sofus A. Macskassy Linear Threshold Units +1 if w x + : : : + w x w h(x) = 1 1 n n 0 1 otherwise ¸ ( ¡ • We assume that each feature xj and each weight wj is a real number (we will relax this later) • We will study three different algorithms for learning linear threshold units: – Perceptron: classifier – Logistic Regression: conditional distribution – Linear Discriminant Analysis: joint distribution Fall 2008 4 Lecture 5 - Sofus A. Macskassy Linear Discriminant Analysis • Learn P(x,y). This is sometimes called the generative approach, because we can think of P(x,y) as a model of how the data is generated. y – For example, if we factor the joint distribution into the form P(x,y) = P(y) P(x | y) – we can think of P(y) as “generating” a value for y according to P(y). Then we can think of P(x | y) as generating a x value for x given the previously- generated value for y. – This can be described as a Bayesian network Fall 2008 5 Lecture 5 - Sofus A. Macskassy Linear Discriminant Analysis (2) • P(y) is a discrete multinomial distribution – example: P(y = 0) = 0.31, P(y = 1) = 0.69 will y generate 31% negative examples and 69% positive examples • For LDA, we assume that P(x | y) is a multivariate normal distribution with mean x k and covariance matrix 1 1 T 1 P (x y = k) = exp [x ¹k] §¡ [x ¹k] j (2¼)n=2 § 1=2 ¡2 ¡ ¡ j j µ ¶ Fall 2008 6 Lecture 5 - Sofus A. Macskassy Multivariate Normal Distributions: A tutorial • Recall that the univariate normal (Gaussian) distribution has the formula 1 1(x ¹)2 ( ) = exp p x ¡2 (2¼)1=2¾ "¡2 ¾ # • where is the mean and 2 is the variance • Graphically, it looks like this: Fall 2008 7 Lecture 5 - Sofus A. Macskassy The Multivariate Gaussian • A 2-dimensional Gaussian is defined by a mean vector = (1,2) and a covariance matrix 2 2 ¾1;1 ¾1;2 § = 2 2 " ¾1;2 ¾2;2 # 2 • where i,j = E[(xi – i)(xj - j)] is the variance (if i = j) or co-variance (if i j). Fall 2008 8 Lecture 5 - Sofus A. Macskassy The Multivariate Gaussian (2) 1 0 • If is the identity matrix § = and " 0 1 # = (0, 0), we get the standard normal distribution: Fall 2008 9 Lecture 5 - Sofus A. Macskassy The Multivariate Gaussian (3) • If is a diagonal matrix, then x1, and x2 are independent random variables, and lines of equal probability are ellipses parallel to the coordinate axes. For example, when 2 0 § = " 0 1 # and ¹ = (2; 3) we obtain Fall 2008 10 Lecture 5 - Sofus A. Macskassy The Multivariate Gaussian (4) • Finally, if is an arbitrary matrix, then x1 and x2 are dependent, and lines of equal probability are ellipses tilted relative to the coordinate axes. For example, when 2 0:5 § = and " 0:5 1 # ¹ = (2; 3) we obtain Fall 2008 11 Lecture 5 - Sofus A. Macskassy Estimating a Multivariate Gaussian • Given a set of N data points {x1, …, xN}, we can compute the maximum likelihood estimate for the multivariate Gaussian distribution as follows: 1 ¹^ = x N i Xi 1 §^ = (x ¹^) (x ¹^)T N i ¡ ¢ i ¡ Xi • Note that the dot product in the second equation is an outer product. The outer product of two vectors is a matrix: x1 x1y1 x1y2 x1y3 T x y = x2 [y1 y2 y3] = x2y1 x2y2 x2y3 ¢ 2 3¢ 2 3 x3 x3y1 x3y2 x3y3 6 7 6 7 4 5 4 5 • For comparison, the usual dot product is written as xT¢ y Fall 2008 12 Lecture 5 - Sofus A. Macskassy The LDA Model • Linear discriminant analysis assumes that the joint distribution has the form P(x|y) 1 1 T 1 P (x; y) = P (y) exp [x ¹y] §¡ [x ¹y] (2¼)n=2 § 1=2 ¡2 ¡ ¡ j j µ ¶ where each y is the mean of a multivariate Gaussian for examples belonging to class y and is a single covariance matrix shared by all classes. Fall 2008 13 Lecture 5 - Sofus A. Macskassy Fitting the LDA Model • It is easy to learn the LDA model in two passes through the data: – Let ¼^ k be our estimate of P(y = k) – Let Nk be the number of training examples belonging to class k. N ¼^ = k k N 1 ¹^ = x k N i k i:y =k f Xi g 1 T §^ = (x ¹^y ) (x ¹^y ) N i ¡ i ¢ i ¡ i Xi • Note that each xi is subtracted from its corresponding ¹^ yi prior to taking the outer product. This gives us the “pooled” estimate of Fall 2008 14 Lecture 5 - Sofus A. Macskassy LDA learns an LTU • Consider the 2-class case with a 0/1 loss function. Recall that P (x; y = 0) P (y = 0 x) = j P (x; y = 0) + P (x; y = 1) P (x; y = 1) P (y = 1 x) = j P (x; y = 0) + P (x; y = 1) • Also recall from our derivation of the Logistic Regression classifier that we should classify into class ŷ = 1 if P (y = 1 x) log j > 0 P (y = 0 x) j • Hence, for LDA, we should classify into ŷ = 1 if P (x; y = 1) log > 0 P (x; y = 0) because the denominators cancel Fall 2008 15 Lecture 5 - Sofus A. Macskassy LDA learns an LTU (2) 1 1 T 1 P (x; y) = P (y) exp [x ¹y] §¡ [x ¹y] (2¼)n=2 § 1=2 ¡2 ¡ ¡ j j µ ¶ 1 1 T 1 P (y = 1) exp [x ¹1] § [x ¹1] P (x; y = 1) (2¼)n=2 § 1=2 ¡2 ¡ ¡ ¡ = j j 1 ³ 1 T 1 ´ P (x; y = 0) P (y = 0) exp [x ¹0] § [x ¹0] (2¼)n=2 § 1=2 ¡2 ¡ ¡ ¡ j j ³ ´ 1 T 1 P (x; y = 1) P (y = 1) exp [x ¹1] §¡ [x ¹1] = ¡2 ¡ ¡ ³ 1 T 1 ´ P (x; y = 0) P (y = 0) exp [x ¹0] § [x ¹0] ¡2 ¡ ¡ ¡ ³ ´ P (x; y = 1) P (y = 1) 1 T 1 T 1 log = log [x ¹1] § [x ¹1] [x ¹0] § [x ¹0] P (x; y = 0) P (y = 0) ¡ 2 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ³ ´ Fall 2008 16 Lecture 5 - Sofus A. Macskassy LDA learns an LTU (3) • Let’s focus on the term in brackets: T 1 T 1 [x ¹1] § [x ¹1] [x ¹0] § [x ¹0] ¡ ¡ ¡ ¡ ¡ ¡ ¡ ³ ´ • Expand the quadratic forms as follows: T 1 T 1 T 1 T 1 T 1 [x ¹1] § [x ¹1] = x § x x § ¹1 ¹ § x + ¹ § ¹1 ¡ ¡ ¡ ¡ ¡ ¡ ¡ 1 ¡ 1 ¡ T 1 T 1 T 1 T 1 T 1 [x ¹0] § [x ¹0] = x § x x § ¹0 ¹ § x + ¹ § ¹0 ¡ ¡ ¡ ¡ ¡ ¡ ¡ 0 ¡ 0 ¡ • Subtract the lower from the upper line and collect similar terms. Note that the quadratic terms cancel! This leaves only terms linear in x. T 1 1 T 1 T 1 x § (¹0 ¹1)+(¹0 ¹1)§ x+¹ § ¹1 ¹ § ¹0 ¡ ¡ ¡ ¡ 1 ¡ ¡ 0 ¡ Fall 2008 17 Lecture 5 - Sofus A. Macskassy LDA learns an LTU (4) T 1 1 T 1 T 1 x § (¹0 ¹1)+(¹0 ¹1)§ x+¹ § ¹1 ¹ § ¹0 ¡ ¡ ¡ ¡ 1 ¡ ¡ 0 ¡ -1 T 1 T 1 • Note that since is symmetric a § ¡ b = b § ¡ a for any two vectors a and b. Hence, the first two terms can be combined to give T 1 T 1 T 1 2x § (¹0 ¹1) + ¹ § ¹1 ¹ § ¹0: ¡ ¡ 1 ¡ ¡ 0 ¡ • Now plug this back in… P (x; y = 1) P (y = 1) 1 T 1 T 1 T 1 log = log 2x § (¹0 ¹1) + ¹ § ¹1 ¹ § ¹0 P (x; y = 0) P (y = 0) ¡ 2 ¡ ¡ 1 ¡ ¡ 0 ¡ h i P (x; y = 1) P (y = 1) T 1 1 T 1 1 T 1 log = log + x § (¹1 ¹0) ¹ § ¹1 + ¹ § ¹0 P (x; y = 0) P (y = 0) ¡ ¡ ¡ 2 1 ¡ 2 0 ¡ Fall 2008 18 Lecture 5 - Sofus A. Macskassy LDA learns an LTU (5) P (x; y = 1) P (y = 1) T 1 1 T 1 1 T 1 log = log + x § (¹1 ¹0) ¹ § ¹1 + ¹ § ¹0 P (x; y = 0) P (y = 0) ¡ ¡ ¡ 2 1 ¡ 2 0 ¡ Let 1 w = § (¹1 ¹0) ¡ ¡ P (y = 1) 1 T 1 1 T 1 c = log ¹ § ¹1 + ¹ § ¹0 P (y = 0) ¡ 2 1 ¡ 2 0 ¡ Then we will classify into class y^ = 1 if w x + c > 0: ¢ This is an LTU. Fall 2008 19 Lecture 5 - Sofus A. Macskassy Two Geometric Views of LDA View 1: Mahalanobis Distance 2 T 1 • The quantity D M ( x ; u ) = ( x u ) § ¡ ( x u ) is known as the (squared) Mahalanobis distance¡ between x and¡ u.

CS 567) Lecture 5

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support