CS 567) Lecture 5

Machine Learning (CS 567) Lecture 5 Fall 2008 Time: T-Th 5:00pm - 6:20pm Location: GFS 118 Instructor: Sofus A. Macskassy ([email protected]) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol Han ([email protected]) Office: SAL 229 Office hours: M 2-3pm, W 11-12 Class web page: http://www-scf.usc.edu/~csci567/index.html Fall 2008 1 Lecture 5 - Sofus A. Macskassy Administrative: Regrading • This regrading policy holds for quizzes, homeworks, midterm and final exam. • Grading complaint policy: if students have problems at assignment grading, feel free to talk to the instructor/TA for it within one week of getting the graded work back. After one week no regrading will be done. • Even if students only request re-grading one of the answers, all this assignment will be checked and graded again. (take the risk of possible losing total points.) Fall 2008 2 Lecture 4 - Sofus A. Macskassy Lecture 5 Outline • Logistic Regression (from last week) • Linear Discriminant Analysis • Off-the-shelf Classifiers Fall 2008 3 Lecture 5 - Sofus A. Macskassy Linear Threshold Units +1 if w x + : : : + w x w h(x) = 1 1 n n 0 1 otherwise ¸ ( ¡ • We assume that each feature xj and each weight wj is a real number (we will relax this later) • We will study three different algorithms for learning linear threshold units: – Perceptron: classifier – Logistic Regression: conditional distribution – Linear Discriminant Analysis: joint distribution Fall 2008 4 Lecture 5 - Sofus A. Macskassy Linear Discriminant Analysis • Learn P(x,y). This is sometimes called the generative approach, because we can think of P(x,y) as a model of how the data is generated. y – For example, if we factor the joint distribution into the form P(x,y) = P(y) P(x | y) – we can think of P(y) as “generating” a value for y according to P(y). Then we can think of P(x | y) as generating a x value for x given the previously- generated value for y. – This can be described as a Bayesian network Fall 2008 5 Lecture 5 - Sofus A. Macskassy Linear Discriminant Analysis (2) • P(y) is a discrete multinomial distribution – example: P(y = 0) = 0.31, P(y = 1) = 0.69 will y generate 31% negative examples and 69% positive examples • For LDA, we assume that P(x | y) is a multivariate normal distribution with mean x k and covariance matrix 1 1 T 1 P (x y = k) = exp [x ¹k] §¡ [x ¹k] j (2¼)n=2 § 1=2 ¡2 ¡ ¡ j j µ ¶ Fall 2008 6 Lecture 5 - Sofus A. Macskassy Multivariate Normal Distributions: A tutorial • Recall that the univariate normal (Gaussian) distribution has the formula 1 1(x ¹)2 ( ) = exp p x ¡2 (2¼)1=2¾ "¡2 ¾ # • where is the mean and 2 is the variance • Graphically, it looks like this: Fall 2008 7 Lecture 5 - Sofus A. Macskassy The Multivariate Gaussian • A 2-dimensional Gaussian is defined by a mean vector = (1,2) and a covariance matrix 2 2 ¾1;1 ¾1;2 § = 2 2 " ¾1;2 ¾2;2 # 2 • where i,j = E[(xi – i)(xj - j)] is the variance (if i = j) or co-variance (if i j). Fall 2008 8 Lecture 5 - Sofus A. Macskassy The Multivariate Gaussian (2) 1 0 • If is the identity matrix § = and " 0 1 # = (0, 0), we get the standard normal distribution: Fall 2008 9 Lecture 5 - Sofus A. Macskassy The Multivariate Gaussian (3) • If is a diagonal matrix, then x1, and x2 are independent random variables, and lines of equal probability are ellipses parallel to the coordinate axes. For example, when 2 0 § = " 0 1 # and ¹ = (2; 3) we obtain Fall 2008 10 Lecture 5 - Sofus A. Macskassy The Multivariate Gaussian (4) • Finally, if is an arbitrary matrix, then x1 and x2 are dependent, and lines of equal probability are ellipses tilted relative to the coordinate axes. For example, when 2 0:5 § = and " 0:5 1 # ¹ = (2; 3) we obtain Fall 2008 11 Lecture 5 - Sofus A. Macskassy Estimating a Multivariate Gaussian • Given a set of N data points {x1, …, xN}, we can compute the maximum likelihood estimate for the multivariate Gaussian distribution as follows: 1 ¹^ = x N i Xi 1 §^ = (x ¹^) (x ¹^)T N i ¡ ¢ i ¡ Xi • Note that the dot product in the second equation is an outer product. The outer product of two vectors is a matrix: x1 x1y1 x1y2 x1y3 T x y = x2 [y1 y2 y3] = x2y1 x2y2 x2y3 ¢ 2 3¢ 2 3 x3 x3y1 x3y2 x3y3 6 7 6 7 4 5 4 5 • For comparison, the usual dot product is written as xT¢ y Fall 2008 12 Lecture 5 - Sofus A. Macskassy The LDA Model • Linear discriminant analysis assumes that the joint distribution has the form P(x|y) 1 1 T 1 P (x; y) = P (y) exp [x ¹y] §¡ [x ¹y] (2¼)n=2 § 1=2 ¡2 ¡ ¡ j j µ ¶ where each y is the mean of a multivariate Gaussian for examples belonging to class y and is a single covariance matrix shared by all classes. Fall 2008 13 Lecture 5 - Sofus A. Macskassy Fitting the LDA Model • It is easy to learn the LDA model in two passes through the data: – Let ¼^ k be our estimate of P(y = k) – Let Nk be the number of training examples belonging to class k. N ¼^ = k k N 1 ¹^ = x k N i k i:y =k f Xi g 1 T §^ = (x ¹^y ) (x ¹^y ) N i ¡ i ¢ i ¡ i Xi • Note that each xi is subtracted from its corresponding ¹^ yi prior to taking the outer product. This gives us the “pooled” estimate of Fall 2008 14 Lecture 5 - Sofus A. Macskassy LDA learns an LTU • Consider the 2-class case with a 0/1 loss function. Recall that P (x; y = 0) P (y = 0 x) = j P (x; y = 0) + P (x; y = 1) P (x; y = 1) P (y = 1 x) = j P (x; y = 0) + P (x; y = 1) • Also recall from our derivation of the Logistic Regression classifier that we should classify into class ŷ = 1 if P (y = 1 x) log j > 0 P (y = 0 x) j • Hence, for LDA, we should classify into ŷ = 1 if P (x; y = 1) log > 0 P (x; y = 0) because the denominators cancel Fall 2008 15 Lecture 5 - Sofus A. Macskassy LDA learns an LTU (2) 1 1 T 1 P (x; y) = P (y) exp [x ¹y] §¡ [x ¹y] (2¼)n=2 § 1=2 ¡2 ¡ ¡ j j µ ¶ 1 1 T 1 P (y = 1) exp [x ¹1] § [x ¹1] P (x; y = 1) (2¼)n=2 § 1=2 ¡2 ¡ ¡ ¡ = j j 1 ³ 1 T 1 ´ P (x; y = 0) P (y = 0) exp [x ¹0] § [x ¹0] (2¼)n=2 § 1=2 ¡2 ¡ ¡ ¡ j j ³ ´ 1 T 1 P (x; y = 1) P (y = 1) exp [x ¹1] §¡ [x ¹1] = ¡2 ¡ ¡ ³ 1 T 1 ´ P (x; y = 0) P (y = 0) exp [x ¹0] § [x ¹0] ¡2 ¡ ¡ ¡ ³ ´ P (x; y = 1) P (y = 1) 1 T 1 T 1 log = log [x ¹1] § [x ¹1] [x ¹0] § [x ¹0] P (x; y = 0) P (y = 0) ¡ 2 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ³ ´ Fall 2008 16 Lecture 5 - Sofus A. Macskassy LDA learns an LTU (3) • Let’s focus on the term in brackets: T 1 T 1 [x ¹1] § [x ¹1] [x ¹0] § [x ¹0] ¡ ¡ ¡ ¡ ¡ ¡ ¡ ³ ´ • Expand the quadratic forms as follows: T 1 T 1 T 1 T 1 T 1 [x ¹1] § [x ¹1] = x § x x § ¹1 ¹ § x + ¹ § ¹1 ¡ ¡ ¡ ¡ ¡ ¡ ¡ 1 ¡ 1 ¡ T 1 T 1 T 1 T 1 T 1 [x ¹0] § [x ¹0] = x § x x § ¹0 ¹ § x + ¹ § ¹0 ¡ ¡ ¡ ¡ ¡ ¡ ¡ 0 ¡ 0 ¡ • Subtract the lower from the upper line and collect similar terms. Note that the quadratic terms cancel! This leaves only terms linear in x. T 1 1 T 1 T 1 x § (¹0 ¹1)+(¹0 ¹1)§ x+¹ § ¹1 ¹ § ¹0 ¡ ¡ ¡ ¡ 1 ¡ ¡ 0 ¡ Fall 2008 17 Lecture 5 - Sofus A. Macskassy LDA learns an LTU (4) T 1 1 T 1 T 1 x § (¹0 ¹1)+(¹0 ¹1)§ x+¹ § ¹1 ¹ § ¹0 ¡ ¡ ¡ ¡ 1 ¡ ¡ 0 ¡ -1 T 1 T 1 • Note that since is symmetric a § ¡ b = b § ¡ a for any two vectors a and b. Hence, the first two terms can be combined to give T 1 T 1 T 1 2x § (¹0 ¹1) + ¹ § ¹1 ¹ § ¹0: ¡ ¡ 1 ¡ ¡ 0 ¡ • Now plug this back in… P (x; y = 1) P (y = 1) 1 T 1 T 1 T 1 log = log 2x § (¹0 ¹1) + ¹ § ¹1 ¹ § ¹0 P (x; y = 0) P (y = 0) ¡ 2 ¡ ¡ 1 ¡ ¡ 0 ¡ h i P (x; y = 1) P (y = 1) T 1 1 T 1 1 T 1 log = log + x § (¹1 ¹0) ¹ § ¹1 + ¹ § ¹0 P (x; y = 0) P (y = 0) ¡ ¡ ¡ 2 1 ¡ 2 0 ¡ Fall 2008 18 Lecture 5 - Sofus A. Macskassy LDA learns an LTU (5) P (x; y = 1) P (y = 1) T 1 1 T 1 1 T 1 log = log + x § (¹1 ¹0) ¹ § ¹1 + ¹ § ¹0 P (x; y = 0) P (y = 0) ¡ ¡ ¡ 2 1 ¡ 2 0 ¡ Let 1 w = § (¹1 ¹0) ¡ ¡ P (y = 1) 1 T 1 1 T 1 c = log ¹ § ¹1 + ¹ § ¹0 P (y = 0) ¡ 2 1 ¡ 2 0 ¡ Then we will classify into class y^ = 1 if w x + c > 0: ¢ This is an LTU. Fall 2008 19 Lecture 5 - Sofus A. Macskassy Two Geometric Views of LDA View 1: Mahalanobis Distance 2 T 1 • The quantity D M ( x ; u ) = ( x u ) § ¡ ( x u ) is known as the (squared) Mahalanobis distance¡ between x and¡ u.

Load more