COMS 4721: Machine Learning for Data Science 4Ptlecture 8, 2/14/2017

COMS 4721: Machine Learning for Data Science Lecture 8, 2/14/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University LINEAR CLASSIFICATION BINARY CLASSIFICATION d We focus on binary classification, with input xi R and output yi 1 . 2 2 {± g I We define a classifier f , which makes prediction yi = f (xi; Θ) based on d a function of xi and parameters Θ. In other words f : R 1; +1 . ! {− g Last lecture, we discussed the Bayes classification framework. I Here, Θ contains: (1) class prior probabilities on y, (2) parameters for class-dependent distribution on x. This lecture we’ll introduce the linear classification framework. I In this approach the prediction is linear in the parameters Θ. I In fact, there is an intersection between the two that we discuss next. ABAYES CLASSIFIER Bayes decisions With the Bayes classifier we predict the class of a new x to be the most probable label given the model and training data (x1; y1);:::; (xn; yn). In the binary case, we declare class y = 1 if p(x y = 1) P(y = 1) > p(x y = 0) P(y = 0) j | {z } j | {z } π1 π0 m p(x y = 1)P(y = 1) ln j > 0 p(x y = 0)P(y = 0) j This second line is referred to as the log odds. ABAYES CLASSIFIER Gaussian with shared covariance Let’s look at the log odds for the special case where p(x y) = N(x µy; Σ) j j (i.e., a single Gaussian with a shared covariance matrix) p(x y = 1)P(y = 1) π1 1 T −1 ln j = ln (µ1 + µ0) Σ (µ1 µ0) p(x y = 0)P(y = 0) π0 − 2 − j | {z } a constant, call it w0 T −1 + x Σ (µ1 µ0) | {z− } a vector, call it w This is also called “linear discriminant analysis” (used to be called LDA). ABAYES CLASSIFIER So we can write the decision rule for this Bayes classifier as a linear one: T 2.6. DISCRIMINANTf (x) = sign FUNCTIONS(x w + w0): FOR THE NORMAL DENSITY 21 I This is what we saw last lecture 4 2 2 (but now class 0 is called 1) 0 -2 1 p(x|ωi) − ω1 ω2 0.15 0.4 0 0.1 I The Bayes classifier0.3 produced a P(ω2)=.5 2 linear decision boundary in the 0.05 0.2 1 data space when Σ = Σ . R 1 0 0 2 0.1 0 x -1 I w and-2w0 are obtained2 through4 a P( )=.5 P(ω2)=.5 ω1 R1 -2 specific equation.R P(ω1)=.5 1 R2 R2 R P(ω1)=.5 P(ω2)=.5 1 -2 -2 -1 0 0 2 1 4 2 Figure 2.10: If the covariances of two distributions are equal and proportional to the identity matrix, then the distributions are spherical in d dimensions, and the boundary is a generalized hyperplane of d 1 dimensions, perpendicular to the line separating the means. In these 1-, 2-, and 3-dimensional− examples, we indicate p(x ω ) and the | i boundaries for the case P (ω1)=P (ω2). In the 3-dimensional case, the grid plane separates from . R1 R2 1 w = µ (52) i σ2 i and 1 w = − µtµ + ln P (ω ). (53) i0 2σ2 i i i We call wi0 the threshold or bias in the ith direction. threshold A classifier that uses linear discriminant functions is called a linear machine. This kind of classifier has many interesting theoretical properties, some of which will be bias discussed in detail in Chap. ??. At this point we merely note that the decision linear surfaces for a linear machine are pieces of hyperplanes defined by the linear equations machine gi(x)=gj(x) for the two categories with the highest posterior probabilities. For our particular case, this equation can be written as wt(x x )=0, (54) − 0 where w = µ µ (55) i − j and 1 σ2 P (ω ) x = (µ + µ ) ln i (µ µ ). (56) 0 2 i j − µ µ 2 P (ω ) i − j ∥ i − j∥ j This equation defines a hyperplane through the point x0 and orthogonal to the vector w. Since w = µ µ , the hyperplane separating and is orthogonal to i − j Ri Rj the line linking the means. If P (ωi)=P (ωj), the second term on the right of Eq. 56 vanishes, and thus the point x0 is halfway between the means, and the hyperplane is the perpendicular bisector of the line between the means (Fig. 2.11). If P (ω ) = P (ω ), i ̸ j the point x0 shifts away from the more likely mean. Note, however, that if the variance LINEAR CLASSIFIERS This Bayes classifier is one instance of a linear classifier T f (x) = sign(x w + w0) where π1 1 T −1 w0 = ln (µ1 + µ0) Σ (µ1 µ0) π0 − 2 − −1 w = Σ (µ1 µ0) − With MLE used to find values for πy; µy and Σ. Setting w0 and w this way may be too restrictive: I This Bayes classifier assumes single Gaussian with shared covariance. I Maybe if we relax what values w0 and w can take we can do better. LINEAR CLASSIFIERS (BINARY CASE) Definition: Binary linear classifier A binary linear classifier is a function of the form T f (x) = sign(x w + w0); d where w R and w0 R. Since the goal is to learn w; w0 from data, we are assuming2 that linear separability2 in x is an accurate property of the classes. Definition: Linear separability Two sets A; B Rd are called linearly separable if ⊂ ( T > 0 if x A (e.g, class +1) x w + w0 2 < 0 if x B (e.g, class 1) 2 − The pair (w; w0) defines an affine hyperplane. It is important to develop the right geometric understanding about what this is doing. HYPERPLANES Geometric interpretation of linear classifiers: x2 A hyperplane in Rd is a linear subspace of dimension (d 1). H − 2 I A R -hyperplane is a line. 3 I A R -hyperplane is a plane. w I As a linear subspace, a hyperplane x1 always contains the origin. A hyperplane H can be represented by a vector w as follows: n d T o H = x R x w = 0 : 2 j WHICH SIDE OF THE PLANE ARE WE ON? x H Distance from the plane w I How close is a point x to H? θ T I Cosine rule: x w = x 2 w 2 cos θ x 2 cos θ k k · k k k k I The distance of x to the hyperplane is T x 2 cos θ = x w = w 2: k k · j j j j k k So xT w gives a sense of distance. j j Which side of the hyperplane? π π I The cosine satisfies cos θ > 0 if θ ( ; ). 2 − 2 2 I So the sign of cos( ) tells us the side of H, and by the cosine rule · sign(cos θ) = sign(xT w): AFFINE HYPERPLANES x 2 H Affine Hyperplanes I An affine hyperplane H is a hyperplane translated (shifted) using a scalar w0. T w I Think of: H = x w + w0 = 0. x1 I Setting w0 > 0 moves the hyperplane in the opposite direction of w.(w0 < 0 in figure) w0= w 2 − k k Which side of the hyperplane now? −w0 I The plane has been shifted by distance in the direction w. kwk2 T I For a given w, w0 and input x the inequality x w + w0 > 0 says that x is on the far side of an affine hyperplane H in the direction w points. CLASSIFICATION WITH AFFINE HYPERPLANES −w0 kwk2 w T sign(x w + w0) > 0 H T sign(x w + w0) < 0 POLYNOMIAL GENERALIZATIONS The same generalizations from regression also hold for classification: I (left) A linear classifier using x = (x1; x2). 2 2 I (right) A linear classifier using x = (x1; x2; x1; x2). The decision boundary is linear in R4, but isn’t when plotted in R2. ANOTHER BAYES CLASSIFIER Gaussian with different covariance Let’s look at the log odds for the general case where p(x y) = N(x µy; Σy) (i.e., now each class has its own covariance) j j p(x y = 1)P(y = 1) ln j = something complicated not involving x p(x y = 0)P(y = 0) | {z } j a constant T −1 −1 + x (Σ1 µ1 Σ0 µ0) | {z− } a part that’s linear in x T −1 −1 + x (Σ0 =2 Σ1 =2)x | {z− } a part that’s quadratic in x Also called “quadratic discriminant analysis,” but it’s linear in the weights. 26 CHAPTER 2. BAYESIAN DECISION THEORY .04 .03 0.01 .02 0.005 .01 0 p p 0 20 20 ANOTHER BAYES CLASSIFIER 10 10 0 0 20 -10 10 0 -10 -10 0 10 -10 I We also saw this last lecture. 20 -20 Notice that 0.05 I 0.03 0.04 f (x) = sign(xT Ax + xT b + c) 0.02 0.03 0.02 0.01 0.01 p p is linear in A; b; c. 0 0 2 I When x R , rewrite as 2 10 10 2 2 x (x1; x2; 2x1x2; x1; x2) -10 0 -10 0 0 0 and do linear classification in R5. 10 -10 10 -10 20 20 Whereas the Bayes classifier with shared covariance is a version of linear 0.15 classification, using different covariances0.01 is like polynomial classification.

Load more