Generative Probabilistic Models for Classification Outline 1 Introduction

CIS 520: Machine Learning Spring 2018: Lecture 2 Generative Probabilistic Models for Classification Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the material discussed in the lecture (and vice versa). Outline • Introduction • Multivariate normal class-conditional densities: Quadratic/linear discriminant analysis (QDA/LDA) • Conditionally independent features: Na¨ıve Bayes • Extensions to multiclass classification 1 Introduction Recall that if we know the joint probability distribution D from which labeled examples are generated, then we can simply use a Bayes optimal classifier for that distribution. For binary classification (under 0-1 loss), ∗ 1 a Bayes optimal classifier is given by h (x) = sign(η(x) − 2 ), where η(x) = P(Y = +1jX = x) is the class probability function under D; in other words, given an instance x, a Bayes optimal classifier predicts class 1 +1 if the probability η(x) of a positive label given x is greater than 2 , and predicts class −1 otherwise. Generative probabilistic models estimate the joint probability distribution D, usually by estimating the overall class probabilities py = P(Y = y) and the class-conditional distributions p(xjy) ≡ p(xjY = y). Here for each class y, p(xjy) denotes a (conditional) probability mass function over the instance space X if X is discrete, and a (conditional) probability density function over X if X is continuous. Such models are said to be generative because they can be used to generate new examples from the distribution (by first sampling a label y with probability py and then sampling an instance x according to p(xjy)). Given such a generative model, the class probabilities η(x) can be obtained via Bayes' rule: p · p(xj+1) p · p(xj+1) η(x) = +1 = +1 : p(x) p+1 · p(xj+1) + p−1 · p(x|−1) Once the generative model components py and p(xjy) have been estimated from the training data, they are used to construct a `plug-in' classifier by using the resulting class probability function estimate in place of the true function η(x) in the form of the Bayes optimal classifier above. Since generative models model the full joint distribution, they often make simplifying assumptions on the form of the distribution in order to obtain estimates from a reasonable number of data points. We will see two examples below: one where features are continuous and one assumes multivariate normal class-conditional densities, and the other where one makes conditional independence assumptions among the features given the labels (Na¨ıveBayes assumption). 1 2 Generative Probabilistic Models for Classification 2 Multivariate Normal Class-Conditional Densities: Quadratic/Linear Discriminant Analysis (QDA/LDA) Suppose first that our instances are continuous feature vectors, with instance space X = Rd, and consider a binary classification task with label and prediction spaces Y = Yb = {±1g. Assume that for each class y 2 Y, the class-conditional density of x given y is a multivariate normal density: 1 1 p(xjy) = exp − (x − µ )>Σ−1(x − µ ) ; d=2 1=2 y y y (2π) jΣyj 2 d d×d where µy 2 R and Σy 2 R are the unknown class-conditional mean and covariance matrix, respectively. Also, let p+1 = P(Y = +1) and p−1 = P(Y = −1) = 1−p+1. As discussed above, the conditional probability of a positive class label for any x 2 X can be obtained using Bayes' rule: p · p(xj+1) η(x) = P(Y = +1jx) = +1 ; p+1 · p(xj+1) + p−1 · p(x|−1) leading to the following Bayes optimal classifier: ( +1 if p+1·p(xj+1) > 1 h∗(x) = p+1·p(xj+1)+p−1·p(x|−1) 2 −1 otherwise ( +1 if p(xj+1) > p−1 = p(x|−1) p+1 −1 otherwise ( +1 if ln p(xj+1) > ln p−1 = p(x|−1) p+1 −1 otherwise p(xj + 1) p = sign ln − ln −1 : p(xj − 1) p+1 Now, under the above assumption on the class-conditional densities, we have p(xj+1) 1 > −1 −1 −1 −1 > > −1 > −1 jΣ−1j ln = x (Σ−1 − Σ+1)x − 2(Σ−1µ−1 − Σ+1µ+1) x + µ−1Σ−1µ−1 − µ+1Σ+1µ+1 + ln : p(x|−1) 2 jΣ+1j This is a quadratic function of x, and we can therefore write the Bayes optimal classifier in this case as h∗(x) = signx>Ax + b>x + c ; where −1 −1 A = Σ−1 − Σ+1 −1 −1 b = −2(Σ−1µ−1 − Σ+1µ+1) > −1 > −1 jΣ−1j p−1 c = µ−1Σ−1µ−1 − µ+1Σ+1µ+1 + ln − 2 ln : jΣ+1j p+1 A classifier of this form is often called a degree-2 polynomial threshold classifier, or simply a quadratic classifier. Note that if in addition, the class-conditional densities have equal covariance matrices, Σ+1 = Σ−1 = Σ, then A = 0, and the Bayes optimal classifier becomes a linear threshold classifier, or simply a linear classifier. Of course, in practice, one does not know the parameters of the class-conditional densities, µ+1; µ−1; Σ+1; Σ−1, or the class probability parameter p+1 = P(Y = +1). In this case, one estimates these quantities from the Generative Probabilistic Models for Classification 3 given training sample S = ((x1; y1);:::; (xm; ym)), and uses the estimated values µb+1; µb−1; Σb +1; Σb −1; pb+1 to obtain a class probability estimate ηbS(x), and a corresponding plug-in classifier 1 hS(x) = sign ηbS(x) − 2 : For example, a natural approach is to use maximum likelihood estimation, which yields 1 X µby = xi my i:yi=y 1 X > Σb y = (xi − µby)(xi − µby) my i:yi=y m p = +1 ; b+1 m 1 where my = fi 2 [m]: yi = yg denotes the number of training examples with class label y. The resulting classifier is called the quadratic discriminant analysis (QDA) classifier. If one assumes the class-conditional covariances are equal, then the maximum likelihood estimate for the common covariance matrix is given by m 1 X > Σb = (xi − µ )(xi − µ ) ; m byi byi i=1 and the resulting classifier is given by > hS(x) = sign b x + bc ; where −1 b = −2 Σb (µb−1 − µb+1) > −1 > −1 1 − pb+1 bc = µb−1Σb µb−1 − µb+1Σb µb+1 − 2 ln : pb+1 This classifier is called the linear discriminant analysis (LDA) classifier. 3 Conditionally Independent Features: Na¨ıve Bayes Above, we assumed a certain parametric form (multivariate normal) for the class-conditional distributions p(xjy). We now consider an alternative, widely used assumption on the class-conditional distributions: namely, that the features are conditionally independent given the labels. This is typically called the Na¨ıve Bayes assumption. We will describe it below for the case of discrete features, although the assumption can also be employed with continuous features.2 Suppose for simplicity that our instances are binary feature vectors, with instance space X = f0; 1gd, and consider again a binary classification task with label and prediction spaces Y = Yb = {±1g. The class- conditional distributions p(xj+1) and p(x|−1) are now discrete. Clearly, in the general case, each of these distributions, defined over the sample space X = f0; 1gd containing 2d elements, is parametrized by 2d − 1 numbers, namely the probabilities of seeing the different elements in X . However these 2d − 1 parameters 1Here [m] denotes the set of integers from 1 through m, i.e. [m] = f1; : : : ; mg. 2When using Na¨ıve Bayes with continuous features, one usually also assumes a parametric form for the distributions of individual features given labels, p(xj jy)(j = 1; : : : ; d). 4 Generative Probabilistic Models for Classification can be estimated reliably only when all instances in X have been seen several times, which is unrealistic in a typical learning situation. The Na¨ıve Bayes assumption allows these class-conditional distributions to be represented more compactly. In particular, under Na¨ıve Bayes, we assume that given the class label y, the individual features in an instance are conditionally independent; i.e. that each class-conditional probability distribution factors as follows: d Y p(xjy) = p(xjjy) : j=1 In this case, one needs to estimate only d parameters for each class-conditional distribution (why is this not d − 1?). Denote a random (d-dimensional) feature vector as X = (X1;:::;Xd), and for each y 2 Y and j 2 f1; : : : ; dg, denote θy;j = P(Xj = 1jY = y) : Then we can write d Y xj 1−xj p(xjy) = (θy;j) (1 − θy;j) : j=1 The conditional probability of a positive label for any x 2 X is again obtained via Bayes' rule: p · p(xj+1) η(x) = P(Y = 1jx) = +1 ; p+1 · p(xj+1) + p−1 · p(x|−1) where again p+1 = P(Y = +1), leading again to the following Bayes optimal classifier: p(xj+1) p h∗(x) = sign ln − ln −1 : p(x|−1) p+1 In this case, we have d p(xj+1) X θ+1;j 1 − θ+1;j ln = x ln + (1 − x ) ln : p(x|−1) j θ j 1 − θ j=1 −1;j −1;j This is a linear function of x, and we can therefore write the Bayes optimal classifier in this case as h∗(x) = signw>x + b ; where θ+1;j 1 − θ+1;j wj = ln − ln θ−1;j 1 − θ−1;j d X 1 − θ+1;j p−1 b = ln − ln : 1 − θ p j=1 −1;j +1 As can be seen, this again yields a linear classifier.

Generative Probabilistic Models for Classification Outline 1 Introduction

Equivalence Between Minimal Generative Model Graphs and Directed Information Graphs

A Hierarchical Generative Model of Electrocardiogram Records

Bias Correction of Learned Generative Models Using Likelihood-Free Importance Weighting

Probabilistic PCA and Factor Analysis

Learning a Generative Model of Images by Factoring Appearance and Shape

Neural Decomposition: Functional ANOVA with Variational Autoencoders

Note 18 : Gaussian Discriminant Analysis

Logistic Regression, Generative and Discriminative Classifiers

A Probabilistic Generative Model for Typographical Analysis of Early Modern Printing

Discriminative and Generative Models for Clinical Risk Estimation: an Empirical Comparison

Acoustic Feature Learning with Deep Variational Canonical Correlation

Time Series Simulation by Conditional Generative Adversarial Net