<<

CIS 520: Machine Learning Spring 2018: Lecture 2

Generative Probabilistic Models for Classification

Lecturer: Shivani Agarwal

Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the material discussed in the lecture (and vice versa).

Outline

• Introduction

• Multivariate normal class-conditional densities: Quadratic/linear discriminant analysis (QDA/LDA)

• Conditionally independent features: Na¨ıve Bayes

• Extensions to multiclass classification

1 Introduction

Recall that if we know the joint D from which labeled examples are generated, then we can simply use a Bayes optimal classifier for that distribution. For binary classification (under 0-1 loss), ∗ 1 a Bayes optimal classifier is given by h (x) = sign(η(x) − 2 ), where η(x) = P(Y = +1|X = x) is the class probability function under D; in other words, given an instance x, a Bayes optimal classifier predicts class 1 +1 if the probability η(x) of a positive label given x is greater than 2 , and predicts class −1 otherwise. Generative probabilistic models estimate the joint probability distribution D, usually by estimating the overall class probabilities py = P(Y = y) and the class-conditional distributions p(x|y) ≡ p(x|Y = y). Here for each class y, p(x|y) denotes a (conditional) probability mass function over the instance space X if X is discrete, and a (conditional) probability density function over X if X is continuous. Such models are said to be generative because they can be used to generate new examples from the distribution (by first a label y with probability py and then sampling an instance x according to p(x|y)). Given such a generative model, the class probabilities η(x) can be obtained via Bayes’ rule:

p · p(x|+1) p · p(x|+1) η(x) = +1 = +1 . p(x) p+1 · p(x|+1) + p−1 · p(x|−1)

Once the generative model components py and p(x|y) have been estimated from the training data, they are used to construct a ‘plug-in’ classifier by using the resulting class probability function estimate in place of the true function η(x) in the form of the Bayes optimal classifier above. Since generative models model the full joint distribution, they often make simplifying assumptions on the form of the distribution in order to obtain estimates from a reasonable number of data points. We will see two examples below: one where features are continuous and one assumes multivariate normal class-conditional densities, and the other where one makes conditional independence assumptions among the features given the labels (Na¨ıveBayes assumption).

1 2 Generative Probabilistic Models for Classification

2 Multivariate Normal Class-Conditional Densities: Quadratic/Linear Discriminant Analysis (QDA/LDA)

Suppose first that our instances are continuous feature vectors, with instance space X = Rd, and consider a binary classification task with label and prediction spaces Y = Yb = {±1}. Assume that for each class y ∈ Y, the class-conditional density of x given y is a multivariate normal density:

1  1  p(x|y) = exp − (x − µ )>Σ−1(x − µ ) , d/2 1/2 y y y (2π) |Σy| 2

d d×d where µy ∈ R and Σy ∈ R are the unknown class-conditional and covariance matrix, respectively. Also, let p+1 = P(Y = +1) and p−1 = P(Y = −1) = 1−p+1. As discussed above, the of a positive class label for any x ∈ X can be obtained using Bayes’ rule: p · p(x|+1) η(x) = P(Y = +1|x) = +1 , p+1 · p(x|+1) + p−1 · p(x|−1) leading to the following Bayes optimal classifier: ( +1 if p+1·p(x|+1) > 1 h∗(x) = p+1·p(x|+1)+p−1·p(x|−1) 2 −1 otherwise ( +1 if p(x|+1) > p−1 = p(x|−1) p+1 −1 otherwise ( +1 if ln p(x|+1)  > ln p−1  = p(x|−1) p+1 −1 otherwise  p(x| + 1) p  = sign ln − ln −1 . p(x| − 1) p+1 Now, under the above assumption on the class-conditional densities, we have

p(x|+1) 1 > −1 −1 −1 −1 > > −1 > −1 |Σ−1| ln = x (Σ−1 − Σ+1)x − 2(Σ−1µ−1 − Σ+1µ+1) x + µ−1Σ−1µ−1 − µ+1Σ+1µ+1 + ln . p(x|−1) 2 |Σ+1| This is a quadratic function of x, and we can therefore write the Bayes optimal classifier in this case as

h∗(x) = signx>Ax + b>x + c ,

where

−1 −1 A = Σ−1 − Σ+1 −1 −1 b = −2(Σ−1µ−1 − Σ+1µ+1)

> −1 > −1 |Σ−1| p−1  c = µ−1Σ−1µ−1 − µ+1Σ+1µ+1 + ln − 2 ln . |Σ+1| p+1 A classifier of this form is often called a degree-2 polynomial threshold classifier, or simply a quadratic classifier. Note that if in addition, the class-conditional densities have equal covariance matrices, Σ+1 = Σ−1 = Σ, then A = 0, and the Bayes optimal classifier becomes a linear threshold classifier, or simply a linear classifier.

Of course, in practice, one does not know the parameters of the class-conditional densities, µ+1, µ−1, Σ+1, Σ−1, or the class probability parameter p+1 = P(Y = +1). In this case, one estimates these quantities from the Generative Probabilistic Models for Classification 3

given training sample S = ((x1, y1),..., (xm, ym)), and uses the estimated values µb+1, µb−1, Σb +1, Σb −1, pb+1 to obtain a class probability estimate ηbS(x), and a corresponding plug-in classifier 1  hS(x) = sign ηbS(x) − 2 . For example, a natural approach is to use maximum likelihood estimation, which yields

1 X µby = xi my i:yi=y

1 X > Σb y = (xi − µby)(xi − µby) my i:yi=y m p = +1 , b+1 m

1 where my = {i ∈ [m]: yi = y} denotes the number of training examples with class label y. The resulting classifier is called the quadratic discriminant analysis (QDA) classifier. If one assumes the class-conditional covariances are equal, then the maximum likelihood estimate for the common covariance matrix is given by

m 1 X > Σb = (xi − µ )(xi − µ ) , m byi byi i=1 and the resulting classifier is given by

>  hS(x) = sign b x + bc , where

−1 b = −2 Σb (µb−1 − µb+1) > −1 > −1 1 − pb+1  bc = µb−1Σb µb−1 − µb+1Σb µb+1 − 2 ln . pb+1 This classifier is called the linear discriminant analysis (LDA) classifier.

3 Conditionally Independent Features: Na¨ıve Bayes

Above, we assumed a certain parametric form (multivariate normal) for the class-conditional distributions p(x|y). We now consider an alternative, widely used assumption on the class-conditional distributions: namely, that the features are conditionally independent given the labels. This is typically called the Na¨ıve Bayes assumption. We will describe it below for the case of discrete features, although the assumption can also be employed with continuous features.2 Suppose for simplicity that our instances are binary feature vectors, with instance space X = {0, 1}d, and consider again a binary classification task with label and prediction spaces Y = Yb = {±1}. The class- conditional distributions p(x|+1) and p(x|−1) are now discrete. Clearly, in the general case, each of these distributions, defined over the sample space X = {0, 1}d containing 2d elements, is parametrized by 2d − 1 numbers, namely the probabilities of seeing the different elements in X . However these 2d − 1 parameters

1Here [m] denotes the set of integers from 1 through m, i.e. [m] = {1, . . . , m}. 2When using Na¨ıve Bayes with continuous features, one usually also assumes a parametric form for the distributions of individual features given labels, p(xj |y)(j = 1, . . . , d). 4 Generative Probabilistic Models for Classification

can be estimated reliably only when all instances in X have been seen several times, which is unrealistic in a typical learning situation. The Na¨ıve Bayes assumption allows these class-conditional distributions to be represented more compactly. In particular, under Na¨ıve Bayes, we assume that given the class label y, the individual features in an instance are conditionally independent; i.e. that each class-conditional probability distribution factors as follows: d Y p(x|y) = p(xj|y) . j=1 In this case, one needs to estimate only d parameters for each class-conditional distribution (why is this not d − 1?). Denote a random (d-dimensional) feature vector as X = (X1,...,Xd), and for each y ∈ Y and j ∈ {1, . . . , d}, denote θy,j = P(Xj = 1|Y = y) . Then we can write d Y xj 1−xj p(x|y) = (θy,j) (1 − θy,j) . j=1 The conditional probability of a positive label for any x ∈ X is again obtained via Bayes’ rule: p · p(x|+1) η(x) = P(Y = 1|x) = +1 , p+1 · p(x|+1) + p−1 · p(x|−1)

where again p+1 = P(Y = +1), leading again to the following Bayes optimal classifier:  p(x|+1) p  h∗(x) = sign ln − ln −1 . p(x|−1) p+1 In this case, we have

d p(x|+1) X  θ+1,j  1 − θ+1,j  ln = x ln + (1 − x ) ln . p(x|−1) j θ j 1 − θ j=1 −1,j −1,j This is a linear function of x, and we can therefore write the Bayes optimal classifier in this case as h∗(x) = signw>x + b , where

θ+1,j  1 − θ+1,j  wj = ln − ln θ−1,j 1 − θ−1,j  d  X 1 − θ+1,j  p−1  b = ln − ln . 1 − θ p j=1 −1,j +1

As can be seen, this again yields a linear classifier. Again, in practice, one estimates the parameters θy,j and p+1 from the given training data S = ((x1, y1),..., (xm, ym)) using maximum likelihood estimation, which yields 1 X θby,j = 1(xij = 1) my i:yi=y m p = +1 , b+1 m

where my = {i ∈ [m]: yi = y} as before. The resulting plug-in classifier, obtained by substituting these parameter estimates in the expression for the Bayes optimal classifier above, is known as the na¨ıve Bayes classifier. Exercise. How does the above derivation change if you have q-ary features, X = {0, . . . , q − 1}d? How many parameters do you now need to estimate for each class? Do you still get a linear classifier? Generative Probabilistic Models for Classification 5

4 Extensions to Multiclass Classification

Let us see how things change when there are K > 2 classes, say Y = [K] = {1,...,K} (such as in the handwritten digit recognition example, where K = 10). In this case, we need to consider the conditional probability of different labels given an instance x ∈ X . For each y ∈ Y, let ηy(x) = P(Y = y|X = x) denote PK the conditional probability of seeing label y given x. Clearly, for all x, y=1 ηy(x) = 1 (in the binary case, we had η+1(x) = η(x) and η−1(x) = 1 − η(x)). What does the optimal classifier look like in this case? This depends on how we measure the performance of a classifier h : X →[K]. Denoting again the joint distribution over X × Y by D, say we define again the accuracy of h w.r.t. D as the probability that an example (x, y) drawn randomly from D is classified correctly by h:  accD[h] = P(X,Y )∼D h(X) = Y .

Equivalently, we use again the 0-1 , defined now over labels and predictions in Y = [K], giving `0-1 :[K] × [K]→R+ defined as `0-1(y, yb) = 1(yb 6= y) , with the corresponding 0-1 error of h w.r.t. D defined as

0-1    erD [h] = P(X,Y )∼D h(X) 6= Y = E(X,Y )∼D `0-1(Y, h(X)) .

We can write this as

0-1   erD [h] = E(X,Y )∼D 1(h(X) 6= Y )    = EX EY |X 1(h(X) 6= Y ) K  X  = EX ηy(X) · 1(h(X) 6= y) y=1  X  = EX ηy(X) y6=h(X)   = EX 1 − ηh(X)(X) .

The minimum achievable 0-1 error w.r.t. D is therefore

0-1,∗ 0-1 h i erD = inf erD [h] = 1 − EX max ηy(X) , h:X →[K] y∈[K] and is clearly achieved by any classifier h∗ : X →[K] satisfying

∗ h (x) ∈ arg max ηy(x) . y∈[K]

In other words, given an instance x, a Bayes optimal classifier here predicts a class y with highest conditional 3 probability ηy(x) = P(Y = y|X = x) given x.

Now, suppose our instances are continuous feature vectors, with instance space X = Rd, and assume again that for each class y ∈ Y, the class-conditional density p(x|y) is a multivariate normal density with mean vector µy and covariance matrix Σy. Also, for each y ∈ Y, let py = P(Y = y) denote the overall probability

3Note that if our loss function assigns a different loss/penalty for different types of mistakes (e.g. if misclassifying a digit 8 as 9 incurs a smaller loss than misclassifying it as 0), then the minimum achievable error as well as the optimal classifier achieving this error will be different. This is true also in the case of binary classification, where for example the cost of mis-diagnosing a cancer patient as normal could be higher than mis-diagnosing a normal patient as having cancer (can you see how the optimal binary classifier would change in this case?). Such problems are often referred to as cost-sensitive classification. 6 Generative Probabilistic Models for Classification

of seeing label y. Then the conditional probability of seeing label y for any x ∈ X can again be obtained by Bayes’ rule: py · p(x|y) ηy(x) = P(Y = y|X = x) = , PK 0 y0=1 py0 · p(x|y ) leading to the following optimal classifier (under the 0-1 loss above):

p · p(x|y) h∗(x) ∈ arg max y y∈[K] PK 0 y0=1 py0 · p(x|y )

= arg max py · p(x|y) y∈[K]  = arg max ln py · p(x|y) y∈[K]

1 1 > −1 = arg max ln py − ln |Σy| − (x − µy) Σy (x − µy) . y∈[K] 2 2

Again, the parameters µy, Σy, py can be estimated from a given training sample S = ((x1, y1),..., (xm, ym)) (e.g. using maximum likelihood estimation as before), and a plug-in classifier using the estimated values can then be constructed based on the above. Note that the above classifier amounts to estimating parameters that determine K quadratic functions fy : X →R for y ∈ [K], and classifying an instance x according to a label y with largest value of fy(x). Similarly, if the class-conditional covariances are assumed to be equal, the above classifier will amount to learning parameters determining K linear functions, and classifying according to the largest value. Exercise. Consider a multiclass classification problem with binary features, X = {0, 1}d and Y = [K], and assume the features are conditionally independent given the class label. Can you derive the Na¨ıve Bayes classifier in this setting?