Lecture 4 – Discriminant Analysis, k-Nearest Neighbors

Johan Wågberg Division of Systems and Control Department of Information Technology Uppsala University.

Email: [email protected]

[email protected] The mini project has now started!

Task: Classify songs according to whether Andreas likes them or not (based on features from the Spotify Web-API).

Training data: 750 songs labeled as like or dislike.

Test data: 200 unlabeled songs. Upload your predictions to the leaderboard!

Examination: A written report (max 6 pages).

Instructions and data is available at the course homepage http://www.it.uu.se/edu/course/homepage/sml/project/ You will receive your group password per e-mail tomorrow. Deadline for submitting your final solution: February 21

1 / 30 [email protected] Summary of Lecture 3 (I/VI)

The classification problem amounts to modeling the relationship between the input x and a categorical output y, i.e., the output belongs to one out of M distinct classes.

A classifier is a prediction model yb(x) that maps any input x into a predicted class yb ∈ {1,...,M}. Common classifier predicts each input as belonging to the most likely class according to the conditional probabilities

p(y = m | x) for m ∈ {1,...,M}.

2 / 30 [email protected] Summary of Lecture 3 (II/VI)

For binary (two-class) classification, y ∈ {−1, 1}, the model is

T eθ x p(y = 1 | x) ≈ g(x) = . 1 + eθTx The model parameters θ are found by maximum likelihood by (numerically) maximizing the log-likelihood function,

n X  T  log `(θ) = − log 1 + e−yiθ xi . i=1

3 / 30 [email protected] Summary of Lecture 3 (III/VI)

Using the common classifier, we get the prediction model

 T 1 if g(x) > 0.5 ⇔ θb x > 0 yb(x) = −1 otherwise

This attempts to minimize the total misclassification error.

More generally, we can use

(1 if g(x) > r, y(x) = b −1 otherwise

where 0 ≤ r ≤ 1 is a user chosen threshold.

4 / 30 [email protected] Summary of Lecture 3 (IV/VI)

Confusion matrix: Predicted condition yb = 0 yb = 1 Total True y = 0 TNFP N condition y = 1 FNTP P Total N* P*

For the classifier (1 if g(x) > r, y(x) = b −1 otherwise the numbers in the confusion matrix will depend on the threshold r. • Decreasing r ⇒ TN, FN decrease and FP, TP increase. • Increasing r ⇒ TN, FN increase and FP, TP decrease. 5 / 30 [email protected] Summary of Lecture 3 (V/VI)

Some commonly used performance measures are:

TP TP • True positive rate: TPR = = ∈ [0, 1] P FN + TP

FP FP • False positive rate: FPR = = ∈ [0, 1] N FP + TN

TP TP • Precision: Prec = = ∈ [0, 1] P* FP + TP

6 / 30 [email protected] Summary of Lecture 3 (VI/VI)

ROC curve for logistic regression 1

0.8

0.6

0.4 True positive rate positive True 0.2

0 0 0.2 0.4 0.6 0.8 1 False positive rate • ROC: plot of TPR vs. FPR as r ranges from 0 to 1.1 • Area Under Curve (AUC): condensed performance measure for the classifier, taking all possible thresholds into account. 1Possible to draw ROC curves based on other tuning parameters as well. 7 / 30 [email protected] Contents – Lecture 4

1. Summary of lecture 3 2. Generative models 3. Linear Discriminant Analysis (LDA) 4. Quadratic Discriminant Analysis (QDA) 5. A nonparametric classifier – k-Nearest Neighbors (kNN)

8 / 30 [email protected] Generative models

A describes how the output y is generated, i.e. p(y | x).

A describes how both the output y and the input x is generated via p(y, x).

Predictions can be derived using laws of probability

p(y, x) p(y)p(x | y) p(y | x) = = R . p(x) y p(y, x)

9 / 30 [email protected] Generative models for classification

We have built classifiers based on models gm(x?) of the conditional probabilities p(y = m | x).

A generative model for classification can be expressed as

p(y = m)p(x | y = m) p(y = m | x) = PM m=1 p(y = m)p(x | y = m)

• p(y = m) denotes the marginal (prior) probability of class m ∈ {1,...,M}. • p(x | y = m) denotes the conditional probability density of x for an observation from class m.

10 / 30 [email protected] Multivariate Gaussian density

The p-dimensional Gaussian (normal) probability density function with mean vector µ and covariance matrix Σ is, 1  1  N (x | µ, Σ) = exp − (x − µ)TΣ−1(x − µ) , (2π)p/2|Σ|1/2 2

where µ : p × 1 vector and Σ : p × p positive definite matrix.

T Let x = (x1, . . . , xp) ∼ N (µ, Σ),

• µj is the mean of xj,

• Σjj is the variance of xj,

• Σij (i 6= j) is the covariance between xi and xj.

11 / 30 [email protected] Multivariate Gaussian density

0.2 0.3

0.15 0.2 0.1 0.1 0.05 Value of PDF Value of PDF

0 0 5 5 5 5 0 0 0 0 -5 -5 x -5 x x -5 x 2 1 2 1 ! ! 1 0 1 0.4 Σ = Σ = 0 1 0.4 0.5

12 / 30 [email protected] Linear discriminant analysis

We need, 1. The prior class probabilities πm , p(y = m), m ∈ {1,...,M}. 2. The conditional probability densities of the input x, fm(x) , p(x | y = m), for each class m. This gives us the model π f (x) g (x) = m m m PM m=1 πmfm(x)

13 / 30 [email protected] Linear discriminant analysis

For the first task, a natural estimator is the proportion of training samples in the mth class 1 n n π = X {y = m} = m bm n I i n i=1

where n is the size of the training set and nm the number of training samples of class m.

14 / 30 [email protected] Linear discriminant analysis

For the second task, a simple model is to assume that fm(x) is a multivariate normal density with mean vector µm and covariance matrix Σm, 1  1  f (x) = exp − (x − µ )TΣ−1(x − µ ) . m p/2 1/2 m m m (2π) |Σm| 2 If we further assume that all classes share the same covariance matrix,

def Σ = Σ1 = ··· = ΣM ,

the remaining parameters of the model are

µ1, µ2,..., µM , Σ.

15 / 30 [email protected] Linear discriminant analysis

These parameters are naturally estimated as the (within class) sample means and sample covariance, respectively:

1 X µb m = xi, m = 1,...,M, nm i:yi=m M 1 X X Σb = (x − µ )(x − µ )T. n − M i b m i b m m=1 i:yi=m

Modeling the class probabilities using the normal assumptions and these parameter estimates is referred to as Linear Discriminant Analysis (LDA).

16 / 30 [email protected] Linear discriminant analysis, summary

The LDA classifier assigns a test input x to class m for which,

−1 1 −1 δb (x) = xTΣb µ − µT Σb µ + log π m b m 2 b m b m bm is largest, where n π = m m = 1,...,M, bm n 1 X µb m = xi, m = 1,...,M, nm i:yi=m M 1 X X Σb = (x − µ )(x − µ )T. n − M i b m i b m m=1 i:yi=m

LDA is a linear classifier (its decision boundaries are linear).

17 / 30 [email protected] ex) LDA decision boundary

Illustration of LDA decision boundary – the level curves of two Gaussian PDFs with the same covariance matrix intersect along a straight line, δb1(x) = δb2(x). 2 x

x 1

18 / 30 [email protected] ex) Simple spam filter

We will use LDA to construct a simple spam filter: • Output: y ∈ {spam, ham} • Input: x = 57-dimensional vector of “features” extracted from the email • Frequencies of 48 predefined words (make, address, all,...) • Frequencies of 6 predefined characters (;, (, [, !, $, #) • Average length of uninterrupted sequences of capital letters • Length of longest uninterrupted sequence of capital letters • Total number of capital letters in the e-mail • Dataset consists of 4, 601 emails classified as either spam or ham (split into 75 % training and 25 % testing)

UCI Repository — Spambase Dataset (https://archive.ics.uci.edu/ml/datasets/Spambase)

19 / 30 [email protected] ex) Simple spam filter

LDA confusion matrix (test data):

yb = 0 yb = 1 yb = 0 yb = 1 y = 0 667 23 y = 0 687 3 y = 1 102 358 y = 1 237 223 Table: Threshold r = 0.5 Table: Threshold r = 0.9

20 / 30 [email protected] ex) Simple spam filter

A linear classifier uses a linear decision boundary for separating the two classes as much as possible. p In R this is a (p − 1)-dimensional hyperplane. How can we visualize the classifier? One way: project the (labeled) test inputs onto the normal of the decision boundary.

200 Y=0 150 Y=1

100

50

0 -0.5 0 0.5 21 / 30 [email protected] Projection on normal Gmail’s spam filter

Official Gmail blog July 9, 2015: “. . . the spam filter now uses an artificial neural network to detect and block the especially sneaky spam—the kind that could actually pass for wanted mail.” — Sri Harsha Somanchi, Product Manager

Official Gmail blog February 6, 2019: “With TensorFlow, we are now blocking around 100 million additional spam messages every day” — Neil Kumaran, Product Manager, Gmail Security

22 / 30 [email protected] Quadratic discriminant analysis

Do we have to assume a common covariance matrix?

No! Estimating a separate covariance matrix for each class leads to an alternative method, Quadratic Discriminant Analysis (QDA).

Whether we should choose LDA or QDA has to do with the bias-variance trade-off, i.e. the risk of over- or underfitting.

Compared to LDA, QDA. . . • has more parameters • is more flexible (lower bias) • has higher risk of overfitting (larger variance)

23 / 30 [email protected] ex) QDA decision boundary

Illustration of QDA decision boundary – the level curves of two Gaussian PDFs with

different covariance matrices intersect along a curved (quadratic) line.

2 x

x 1 24 / 30 [email protected] Thus, QDA is not a linear classifier. Parametric and nonparametric models

So far we have looked at a few parametric models, • , • logistic regression, • LDA and QDA, all of which are parametrized by a fixed-dimensional parameter.

Non-parametric models instead allow the flexibility of the model to grow with the amount of available data.

N can be very flexible (= low bias) H can suffer from overfitting (high variance) H can be compute and memory intensive As always, the bias-variance trade-off is key also when working with non-parametric models. 25 / 30 [email protected] k-NN

The k-nearest neighbors (k-NN) classifier is a simple non-parametric method.

n Given training data T = {(xi, yi)}i=1, for a test input x?, 1. Identify the set

R? = {i : xi is one of the k training data points closest to x?}

2. Classify x? according to a majority vote within the neighborhood R?, i.e. assign x? to class m for which X I{yi = m} i∈R? is largest.

26 / 30 [email protected] ex) k-NN on a toy model

We illustrate the k-NN classifier on a synthetic example where the optimal classifier is known (colored regions).

4 Y=1 Y=2 2 Y=3

0

2 X

-2

-4

-6 -4 -2 0 2 4 X 1

27 / 30 [email protected] ex) k-NN on a toy model

The decision boundaries for the k-NN classifier are shown as black lines.

k=1 k=10 k=100 4 4 4 Y=1 Y=1 Y=1 Y=2 Y=2 Y=2 2 Y=3 2 Y=3 2 Y=3

0 0 0

2 2 2

X X X

-2 -2 -2

-4 -4 -4

-6 -4 -2 0 2 4 -6 -4 -2 0 2 4 -6 -4 -2 0 2 4 X X X 1 1 1

28 / 30 [email protected] ex) k-NN on a toy model

k=1 k=10 k=100 4 4 4 Y=1 Y=1 Y=1 Y=2 Y=2 Y=2 2 Y=3 2 Y=3 2 Y=3

0 0 0

2 2 2

X X X

-2 -2 -2

-4 -4 -4

-6 -4 -2 0 2 4 -6 -4 -2 0 2 4 -6 -4 -2 0 2 4 X X X 1 1 1 The choice of the tuning parameter k controls the model flexibility: • Small k ⇒ small bias, large variance • Large k ⇒ large bias, small variance

29 / 30 [email protected] A few concepts to summarize lecture 4

Generative model: A model that describes both the output y and the input x. Linear discriminant analysis (LDA): A classifier based on the assumption that the distribution of the input x is multivariate Gaussian for each class, with different means but the same covariance. LDA is a linear classifier. Quadratic discriminant analysis (QDA): Same as LDA, but where a different covariance matrix is used for each class. QDA is not a linear classifier. Non-parametric model: A model where the flexibility is allowed to grow with the amount of available training data. k-NN classifier: A simple non-parametric classifier based on classifying a test input x according to the class labels of the k training samples nearest to x.

30 / 30 [email protected]