CIS 520: Machine Learning Spring 2018: Lecture 2
Generative Probabilistic Models for Classification
Lecturer: Shivani Agarwal
Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the material discussed in the lecture (and vice versa).
Outline
• Introduction
• Multivariate normal class-conditional densities: Quadratic/linear discriminant analysis (QDA/LDA)
• Conditionally independent features: Na¨ıve Bayes
• Extensions to multiclass classification
1 Introduction
Recall that if we know the joint probability distribution D from which labeled examples are generated, then we can simply use a Bayes optimal classifier for that distribution. For binary classification (under 0-1 loss), ∗ 1 a Bayes optimal classifier is given by h (x) = sign(η(x) − 2 ), where η(x) = P(Y = +1|X = x) is the class probability function under D; in other words, given an instance x, a Bayes optimal classifier predicts class 1 +1 if the probability η(x) of a positive label given x is greater than 2 , and predicts class −1 otherwise. Generative probabilistic models estimate the joint probability distribution D, usually by estimating the overall class probabilities py = P(Y = y) and the class-conditional distributions p(x|y) ≡ p(x|Y = y). Here for each class y, p(x|y) denotes a (conditional) probability mass function over the instance space X if X is discrete, and a (conditional) probability density function over X if X is continuous. Such models are said to be generative because they can be used to generate new examples from the distribution (by first sampling a label y with probability py and then sampling an instance x according to p(x|y)). Given such a generative model, the class probabilities η(x) can be obtained via Bayes’ rule:
p · p(x|+1) p · p(x|+1) η(x) = +1 = +1 . p(x) p+1 · p(x|+1) + p−1 · p(x|−1)
Once the generative model components py and p(x|y) have been estimated from the training data, they are used to construct a ‘plug-in’ classifier by using the resulting class probability function estimate in place of the true function η(x) in the form of the Bayes optimal classifier above. Since generative models model the full joint distribution, they often make simplifying assumptions on the form of the distribution in order to obtain estimates from a reasonable number of data points. We will see two examples below: one where features are continuous and one assumes multivariate normal class-conditional densities, and the other where one makes conditional independence assumptions among the features given the labels (Na¨ıveBayes assumption).
1 2 Generative Probabilistic Models for Classification
2 Multivariate Normal Class-Conditional Densities: Quadratic/Linear Discriminant Analysis (QDA/LDA)
Suppose first that our instances are continuous feature vectors, with instance space X = Rd, and consider a binary classification task with label and prediction spaces Y = Yb = {±1}. Assume that for each class y ∈ Y, the class-conditional density of x given y is a multivariate normal density:
1 1 p(x|y) = exp − (x − µ )>Σ−1(x − µ ) , d/2 1/2 y y y (2π) |Σy| 2
d d×d where µy ∈ R and Σy ∈ R are the unknown class-conditional mean and covariance matrix, respectively. Also, let p+1 = P(Y = +1) and p−1 = P(Y = −1) = 1−p+1. As discussed above, the conditional probability of a positive class label for any x ∈ X can be obtained using Bayes’ rule: p · p(x|+1) η(x) = P(Y = +1|x) = +1 , p+1 · p(x|+1) + p−1 · p(x|−1) leading to the following Bayes optimal classifier: ( +1 if p+1·p(x|+1) > 1 h∗(x) = p+1·p(x|+1)+p−1·p(x|−1) 2 −1 otherwise ( +1 if p(x|+1) > p−1 = p(x|−1) p+1 −1 otherwise ( +1 if ln p(x|+1) > ln p−1 = p(x|−1) p+1 −1 otherwise p(x| + 1) p = sign ln − ln −1 . p(x| − 1) p+1 Now, under the above assumption on the class-conditional densities, we have
p(x|+1) 1 > −1 −1 −1 −1 > > −1 > −1 |Σ−1| ln = x (Σ−1 − Σ+1)x − 2(Σ−1µ−1 − Σ+1µ+1) x + µ−1Σ−1µ−1 − µ+1Σ+1µ+1 + ln . p(x|−1) 2 |Σ+1| This is a quadratic function of x, and we can therefore write the Bayes optimal classifier in this case as
h∗(x) = sign x>Ax + b>x + c ,
where
−1 −1 A = Σ−1 − Σ+1 −1 −1 b = −2(Σ−1µ−1 − Σ+1µ+1)
> −1 > −1 |Σ−1| p−1 c = µ−1Σ−1µ−1 − µ+1Σ+1µ+1 + ln − 2 ln . |Σ+1| p+1 A classifier of this form is often called a degree-2 polynomial threshold classifier, or simply a quadratic classifier. Note that if in addition, the class-conditional densities have equal covariance matrices, Σ+1 = Σ−1 = Σ, then A = 0, and the Bayes optimal classifier becomes a linear threshold classifier, or simply a linear classifier.
Of course, in practice, one does not know the parameters of the class-conditional densities, µ+1, µ−1, Σ+1, Σ−1, or the class probability parameter p+1 = P(Y = +1). In this case, one estimates these quantities from the Generative Probabilistic Models for Classification 3
given training sample S = ((x1, y1),..., (xm, ym)), and uses the estimated values µb+1, µb−1, Σb +1, Σb −1, pb+1 to obtain a class probability estimate ηbS(x), and a corresponding plug-in classifier 1 hS(x) = sign ηbS(x) − 2 . For example, a natural approach is to use maximum likelihood estimation, which yields
1 X µby = xi my i:yi=y
1 X > Σb y = (xi − µby)(xi − µby) my i:yi=y m p = +1 , b+1 m
1 where my = {i ∈ [m]: yi = y} denotes the number of training examples with class label y. The resulting classifier is called the quadratic discriminant analysis (QDA) classifier. If one assumes the class-conditional covariances are equal, then the maximum likelihood estimate for the common covariance matrix is given by
m 1 X > Σb = (xi − µ )(xi − µ ) , m byi byi i=1 and the resulting classifier is given by