Lecture 4 – Discriminant Analysis, K-Nearest Neighbors

Lecture 4 – Discriminant Analysis, k-Nearest Neighbors Johan Wågberg Division of Systems and Control Department of Information Technology Uppsala University. Email: [email protected] [email protected] The mini project has now started! Task: Classify songs according to whether Andreas likes them or not (based on features from the Spotify Web-API). Training data: 750 songs labeled as like or dislike. Test data: 200 unlabeled songs. Upload your predictions to the leaderboard! Examination: A written report (max 6 pages). Instructions and data is available at the course homepage http://www.it.uu.se/edu/course/homepage/sml/project/ You will receive your group password per e-mail tomorrow. Deadline for submitting your final solution: February 21 1 / 30 [email protected] Summary of Lecture 3 (I/VI) The classification problem amounts to modeling the relationship between the input x and a categorical output y, i.e., the output belongs to one out of M distinct classes. A classifier is a prediction model yb(x) that maps any input x into a predicted class yb ∈ {1,...,M}. Common classifier predicts each input as belonging to the most likely class according to the conditional probabilities p(y = m | x) for m ∈ {1,...,M}. 2 / 30 [email protected] Summary of Lecture 3 (II/VI) For binary (two-class) classification, y ∈ {−1, 1}, the logistic regression model is T eθ x p(y = 1 | x) ≈ g(x) = . 1 + eθTx The model parameters θ are found by maximum likelihood by (numerically) maximizing the log-likelihood function, n X T log `(θ) = − log 1 + e−yiθ xi . i=1 3 / 30 [email protected] Summary of Lecture 3 (III/VI) Using the common classifier, we get the prediction model T 1 if g(x) > 0.5 ⇔ θb x > 0 yb(x) = −1 otherwise This attempts to minimize the total misclassification error. More generally, we can use (1 if g(x) > r, y(x) = b −1 otherwise where 0 ≤ r ≤ 1 is a user chosen threshold. 4 / 30 [email protected] Summary of Lecture 3 (IV/VI) Confusion matrix: Predicted condition yb = 0 yb = 1 Total True y = 0 TNFP N condition y = 1 FNTP P Total N* P* For the classifier (1 if g(x) > r, y(x) = b −1 otherwise the numbers in the confusion matrix will depend on the threshold r. • Decreasing r ⇒ TN, FN decrease and FP, TP increase. • Increasing r ⇒ TN, FN increase and FP, TP decrease. 5 / 30 [email protected] Summary of Lecture 3 (V/VI) Some commonly used performance measures are: TP TP • True positive rate: TPR = = ∈ [0, 1] P FN + TP FP FP • False positive rate: FPR = = ∈ [0, 1] N FP + TN TP TP • Precision: Prec = = ∈ [0, 1] P* FP + TP 6 / 30 [email protected] Summary of Lecture 3 (VI/VI) ROC curve for logistic regression 1 0.8 0.6 0.4 True positive rate 0.2 0 0 0.2 0.4 0.6 0.8 1 False positive rate • ROC: plot of TPR vs. FPR as r ranges from 0 to 1.1 • Area Under Curve (AUC): condensed performance measure for the classifier, taking all possible thresholds into account. 1Possible to draw ROC curves based on other tuning parameters as well. 7 / 30 [email protected] Contents – Lecture 4 1. Summary of lecture 3 2. Generative models 3. Linear Discriminant Analysis (LDA) 4. Quadratic Discriminant Analysis (QDA) 5. A nonparametric classifier – k-Nearest Neighbors (kNN) 8 / 30 [email protected] Generative models A discriminative model describes how the output y is generated, i.e. p(y | x). A generative model describes how both the output y and the input x is generated via p(y, x). Predictions can be derived using laws of probability p(y, x) p(y)p(x | y) p(y | x) = = R . p(x) y p(y, x) 9 / 30 [email protected] Generative models for classification We have built classifiers based on models gm(x?) of the conditional probabilities p(y = m | x). A generative model for classification can be expressed as p(y = m)p(x | y = m) p(y = m | x) = PM m=1 p(y = m)p(x | y = m) • p(y = m) denotes the marginal (prior) probability of class m ∈ {1,...,M}. • p(x | y = m) denotes the conditional probability density of x for an observation from class m. 10 / 30 [email protected] Multivariate Gaussian density The p-dimensional Gaussian (normal) probability density function with mean vector µ and covariance matrix Σ is, 1 1 N (x | µ, Σ) = exp − (x − µ)TΣ−1(x − µ) , (2π)p/2|Σ|1/2 2 where µ : p × 1 vector and Σ : p × p positive definite matrix. T Let x = (x1, . , xp) ∼ N (µ, Σ), • µj is the mean of xj, • Σjj is the variance of xj, • Σij (i 6= j) is the covariance between xi and xj. 11 / 30 [email protected] Multivariate Gaussian density 0.2 0.3 0.15 0.2 0.1 0.1 0.05 Value of PDF Value of PDF 0 0 5 5 5 5 0 0 0 0 -5 -5 x -5 x x -5 x 2 1 2 1 ! ! 1 0 1 0.4 Σ = Σ = 0 1 0.4 0.5 12 / 30 [email protected] Linear discriminant analysis We need, 1. The prior class probabilities πm , p(y = m), m ∈ {1,...,M}. 2. The conditional probability densities of the input x, fm(x) , p(x | y = m), for each class m. This gives us the model π f (x) g (x) = m m m PM m=1 πmfm(x) 13 / 30 [email protected] Linear discriminant analysis For the first task, a natural estimator is the proportion of training samples in the mth class 1 n n π = X {y = m} = m bm n I i n i=1 where n is the size of the training set and nm the number of training samples of class m. 14 / 30 [email protected] Linear discriminant analysis For the second task, a simple model is to assume that fm(x) is a multivariate normal density with mean vector µm and covariance matrix Σm, 1 1 f (x) = exp − (x − µ )TΣ−1(x − µ ) . m p/2 1/2 m m m (2π) |Σm| 2 If we further assume that all classes share the same covariance matrix, def Σ = Σ1 = ··· = ΣM , the remaining parameters of the model are µ1, µ2,..., µM , Σ. 15 / 30 [email protected] Linear discriminant analysis These parameters are naturally estimated as the (within class) sample means and sample covariance, respectively: 1 X µb m = xi, m = 1, . , M, nm i:yi=m M 1 X X Σb = (x − µ )(x − µ )T. n − M i b m i b m m=1 i:yi=m Modeling the class probabilities using the normal assumptions and these parameter estimates is referred to as Linear Discriminant Analysis (LDA). 16 / 30 [email protected] Linear discriminant analysis, summary The LDA classifier assigns a test input x to class m for which, −1 1 −1 δb (x) = xTΣb µ − µT Σb µ + log π m b m 2 b m b m bm is largest, where n π = m m = 1, . , M, bm n 1 X µb m = xi, m = 1, . , M, nm i:yi=m M 1 X X Σb = (x − µ )(x − µ )T. n − M i b m i b m m=1 i:yi=m LDA is a linear classifier (its decision boundaries are linear). 17 / 30 [email protected] ex) LDA decision boundary Illustration of LDA decision boundary – the level curves of two Gaussian PDFs with the same covariance matrix intersect along a straight line, δb1(x) = δb2(x). 2 x x 1 18 / 30 [email protected] ex) Simple spam filter We will use LDA to construct a simple spam filter: • Output: y ∈ {spam, ham} • Input: x = 57-dimensional vector of “features” extracted from the email • Frequencies of 48 predefined words (make, address, all,...) • Frequencies of 6 predefined characters (;, (, [, !, $, #) • Average length of uninterrupted sequences of capital letters • Length of longest uninterrupted sequence of capital letters • Total number of capital letters in the e-mail • Dataset consists of 4, 601 emails classified as either spam or ham (split into 75 % training and 25 % testing) UCI Machine Learning Repository — Spambase Dataset (https://archive.ics.uci.edu/ml/datasets/Spambase) 19 / 30 [email protected] ex) Simple spam filter LDA confusion matrix (test data): yb = 0 yb = 1 yb = 0 yb = 1 y = 0 667 23 y = 0 687 3 y = 1 102 358 y = 1 237 223 Table: Threshold r = 0.5 Table: Threshold r = 0.9 20 / 30 [email protected] ex) Simple spam filter A linear classifier uses a linear decision boundary for separating the two classes as much as possible. p In R this is a (p − 1)-dimensional hyperplane. How can we visualize the classifier? One way: project the (labeled) test inputs onto the normal of the decision boundary. 200 Y=0 150 Y=1 100 50 0 -0.5 0 0.5 21 / 30 [email protected] Projection on normal Gmail’s spam filter Official Gmail blog July 9, 2015: “. the spam filter now uses an artificial neural network to detect and block the especially sneaky spam—the kind that could actually pass for wanted mail.” — Sri Harsha Somanchi, Product Manager Official Gmail blog February 6, 2019: “With TensorFlow, we are now blocking around 100 million additional spam messages every day” — Neil Kumaran, Product Manager, Gmail Security 22 / 30 [email protected] Quadratic discriminant analysis Do we have to assume a common covariance matrix? No! Estimating a separate covariance matrix for each class leads to an alternative method, Quadratic Discriminant Analysis (QDA).

Load more