Lecture 4 – Discriminant Analysis, K-Nearest Neighbors

Lecture 4 – Discriminant Analysis, k-Nearest Neighbors Johan Wågberg Division of Systems and Control Department of Information Technology Uppsala University. Email: [email protected] [email protected] The mini project has now started! Task: Classify songs according to whether Andreas likes them or not (based on features from the Spotify Web-API). Training data: 750 songs labeled as like or dislike. Test data: 200 unlabeled songs. Upload your predictions to the leaderboard! Examination: A written report (max 6 pages). Instructions and data is available at the course homepage http://www.it.uu.se/edu/course/homepage/sml/project/ You will receive your group password per e-mail tomorrow. Deadline for submitting your final solution: February 21 1 / 30 [email protected] Summary of Lecture 3 (I/VI) The classification problem amounts to modeling the relationship between the input x and a categorical output y, i.e., the output belongs to one out of M distinct classes. A classifier is a prediction model yb(x) that maps any input x into a predicted class yb ∈ {1,...,M}. Common classifier predicts each input as belonging to the most likely class according to the conditional probabilities p(y = m | x) for m ∈ {1,...,M}. 2 / 30 [email protected] Summary of Lecture 3 (II/VI) For binary (two-class) classification, y ∈ {−1, 1}, the logistic regression model is T eθ x p(y = 1 | x) ≈ g(x) = . 1 + eθTx The model parameters θ are found by maximum likelihood by (numerically) maximizing the log-likelihood function, n X T log `(θ) = − log 1 + e−yiθ xi . i=1 3 / 30 [email protected] Summary of Lecture 3 (III/VI) Using the common classifier, we get the prediction model T 1 if g(x) > 0.5 ⇔ θb x > 0 yb(x) = −1 otherwise This attempts to minimize the total misclassification error. More generally, we can use (1 if g(x) > r, y(x) = b −1 otherwise where 0 ≤ r ≤ 1 is a user chosen threshold. 4 / 30 [email protected] Summary of Lecture 3 (IV/VI) Confusion matrix: Predicted condition yb = 0 yb = 1 Total True y = 0 TNFP N condition y = 1 FNTP P Total N* P* For the classifier (1 if g(x) > r, y(x) = b −1 otherwise the numbers in the confusion matrix will depend on the threshold r. • Decreasing r ⇒ TN, FN decrease and FP, TP increase. • Increasing r ⇒ TN, FN increase and FP, TP decrease. 5 / 30 [email protected] Summary of Lecture 3 (V/VI) Some commonly used performance measures are: TP TP • True positive rate: TPR = = ∈ [0, 1] P FN + TP FP FP • False positive rate: FPR = = ∈ [0, 1] N FP + TN TP TP • Precision: Prec = = ∈ [0, 1] P* FP + TP 6 / 30 [email protected] Summary of Lecture 3 (VI/VI) ROC curve for logistic regression 1 0.8 0.6 0.4 True positive rate 0.2 0 0 0.2 0.4 0.6 0.8 1 False positive rate • ROC: plot of TPR vs. FPR as r ranges from 0 to 1.1 • Area Under Curve (AUC): condensed performance measure for the classifier, taking all possible thresholds into account. 1Possible to draw ROC curves based on other tuning parameters as well. 7 / 30 [email protected] Contents – Lecture 4 1. Summary of lecture 3 2. Generative models 3. Linear Discriminant Analysis (LDA) 4. Quadratic Discriminant Analysis (QDA) 5. A nonparametric classifier – k-Nearest Neighbors (kNN) 8 / 30 [email protected] Generative models A discriminative model describes how the output y is generated, i.e. p(y | x). A generative model describes how both the output y and the input x is generated via p(y, x). Predictions can be derived using laws of probability p(y, x) p(y)p(x | y) p(y | x) = = R . p(x) y p(y, x) 9 / 30 [email protected] Generative models for classification We have built classifiers based on models gm(x?) of the conditional probabilities p(y = m | x). A generative model for classification can be expressed as p(y = m)p(x | y = m) p(y = m | x) = PM m=1 p(y = m)p(x | y = m) • p(y = m) denotes the marginal (prior) probability of class m ∈ {1,...,M}. • p(x | y = m) denotes the conditional probability density of x for an observation from class m. 10 / 30 [email protected] Multivariate Gaussian density The p-dimensional Gaussian (normal) probability density function with mean vector µ and covariance matrix Σ is, 1 1 N (x | µ, Σ) = exp − (x − µ)TΣ−1(x − µ) , (2π)p/2|Σ|1/2 2 where µ : p × 1 vector and Σ : p × p positive definite matrix. T Let x = (x1, . , xp) ∼ N (µ, Σ), • µj is the mean of xj, • Σjj is the variance of xj, • Σij (i 6= j) is the covariance between xi and xj. 11 / 30 [email protected] Multivariate Gaussian density 0.2 0.3 0.15 0.2 0.1 0.1 0.05 Value of PDF Value of PDF 0 0 5 5 5 5 0 0 0 0 -5 -5 x -5 x x -5 x 2 1 2 1 ! ! 1 0 1 0.4 Σ = Σ = 0 1 0.4 0.5 12 / 30 [email protected] Linear discriminant analysis We need, 1. The prior class probabilities πm , p(y = m), m ∈ {1,...,M}. 2. The conditional probability densities of the input x, fm(x) , p(x | y = m), for each class m. This gives us the model π f (x) g (x) = m m m PM m=1 πmfm(x) 13 / 30 [email protected] Linear discriminant analysis For the first task, a natural estimator is the proportion of training samples in the mth class 1 n n π = X {y = m} = m bm n I i n i=1 where n is the size of the training set and nm the number of training samples of class m. 14 / 30 [email protected] Linear discriminant analysis For the second task, a simple model is to assume that fm(x) is a multivariate normal density with mean vector µm and covariance matrix Σm, 1 1 f (x) = exp − (x − µ )TΣ−1(x − µ ) . m p/2 1/2 m m m (2π) |Σm| 2 If we further assume that all classes share the same covariance matrix, def Σ = Σ1 = ··· = ΣM , the remaining parameters of the model are µ1, µ2,..., µM , Σ. 15 / 30 [email protected] Linear discriminant analysis These parameters are naturally estimated as the (within class) sample means and sample covariance, respectively: 1 X µb m = xi, m = 1, . , M, nm i:yi=m M 1 X X Σb = (x − µ )(x − µ )T. n − M i b m i b m m=1 i:yi=m Modeling the class probabilities using the normal assumptions and these parameter estimates is referred to as Linear Discriminant Analysis (LDA). 16 / 30 [email protected] Linear discriminant analysis, summary The LDA classifier assigns a test input x to class m for which, −1 1 −1 δb (x) = xTΣb µ − µT Σb µ + log π m b m 2 b m b m bm is largest, where n π = m m = 1, . , M, bm n 1 X µb m = xi, m = 1, . , M, nm i:yi=m M 1 X X Σb = (x − µ )(x − µ )T. n − M i b m i b m m=1 i:yi=m LDA is a linear classifier (its decision boundaries are linear). 17 / 30 [email protected] ex) LDA decision boundary Illustration of LDA decision boundary – the level curves of two Gaussian PDFs with the same covariance matrix intersect along a straight line, δb1(x) = δb2(x). 2 x x 1 18 / 30 [email protected] ex) Simple spam filter We will use LDA to construct a simple spam filter: • Output: y ∈ {spam, ham} • Input: x = 57-dimensional vector of “features” extracted from the email • Frequencies of 48 predefined words (make, address, all,...) • Frequencies of 6 predefined characters (;, (, [, !, $, #) • Average length of uninterrupted sequences of capital letters • Length of longest uninterrupted sequence of capital letters • Total number of capital letters in the e-mail • Dataset consists of 4, 601 emails classified as either spam or ham (split into 75 % training and 25 % testing) UCI Machine Learning Repository — Spambase Dataset (https://archive.ics.uci.edu/ml/datasets/Spambase) 19 / 30 [email protected] ex) Simple spam filter LDA confusion matrix (test data): yb = 0 yb = 1 yb = 0 yb = 1 y = 0 667 23 y = 0 687 3 y = 1 102 358 y = 1 237 223 Table: Threshold r = 0.5 Table: Threshold r = 0.9 20 / 30 [email protected] ex) Simple spam filter A linear classifier uses a linear decision boundary for separating the two classes as much as possible. p In R this is a (p − 1)-dimensional hyperplane. How can we visualize the classifier? One way: project the (labeled) test inputs onto the normal of the decision boundary. 200 Y=0 150 Y=1 100 50 0 -0.5 0 0.5 21 / 30 [email protected] Projection on normal Gmail’s spam filter Official Gmail blog July 9, 2015: “. the spam filter now uses an artificial neural network to detect and block the especially sneaky spam—the kind that could actually pass for wanted mail.” — Sri Harsha Somanchi, Product Manager Official Gmail blog February 6, 2019: “With TensorFlow, we are now blocking around 100 million additional spam messages every day” — Neil Kumaran, Product Manager, Gmail Security 22 / 30 [email protected] Quadratic discriminant analysis Do we have to assume a common covariance matrix? No! Estimating a separate covariance matrix for each class leads to an alternative method, Quadratic Discriminant Analysis (QDA).

Lecture 4 – Discriminant Analysis, K-Nearest Neighbors

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support