CS480 Introduction to Machine Learning Linear Models for Classification

Edith Law

Based on slides from Joelle Pineau Probabilistic Framework for Classification

As we saw in the last lecture, learning can be posed a problem of statistical inference. The premise is that for any learning problem, if we have access to the underlying distribution D, then we could form a Bayes optimal classifier as:

f (BO)(x)̂ = arg max D(x,̂ y)̂ y∈̂ Y

But since we don't have access to D, we can instead estimate this distribution from data. The probabilistic framework allows you to choose a model to represent your data, and develop algorithms that have inductive biases that are closer to what you, as a designer, believe.

2 Classification by Density Estimation

The most direct way to construct such a probability distribution is to select a family of parametric distributions. For example, we can select a Gaussian (or Normal) distribution. Gaussian is parametric: its parameters are its mean and variance. The job of learning is to infer which parameters are “best” in terms of describing the observed training data. In density estimation, we assume that the sample data is drawn independently from the same distribution D. That is, your nth sample (xn, yn) from D does not depend on the previous n-1 samples. This is called the i.i.d. (independently and identically distributed) assumption.

3 Consider the Problem of Flipping a Biased Coin

You are flipping a coin, and you want to find out if it is biased.

This coin has some unknown probability of β of coming up Head (H), and some probability ( 1 − β ) of coming up Tail (T). Suppose you flip the coin and observed HHTH. Assuming that each flip is independent, the probability of the data (i.e., the coin flip sequence) is

p(D|β) = p(HHTH|β) = β ⋅ β ⋅ (1 − β) ⋅ β = β3(1 − β) = β3 − β4

4 Statistical Estimation

To find the most likely value of β , you want to select parameters to maximize the the probability of the data.

This means taking the derivative of p ( D | β ) with respect to β , set it to zero and solve for δ [β3 − β4] = 3β2 − 4β2 δβ 4β3 = 3β2 ⟺ 4β = 3 3 ⟺ β = 4 This gives you the maximum likelihood estimate of the parameters of your model.

5 Statistical Estimation

But notice that the likelihood of the data p ( D | β ) actually has the form of the binomial distribution: βH(1 − β)T

The log likelihood of the binomial distribution can be rewritten log p(D|β) = log βH(1 − β)T = H log β + T log(1 − β)

When we differentiate with respect to beta, and set to zero. δ H T p(D|β) = − δβ β 1 − β H ⇒ β = The estimate is equivalent to the fraction of H + T the observed data that comes up heads! Maximum Likelihood Estimate is nothing more than the relative frequency of observing a certain event (e.g., observing heads). 6 From Estimation to Inference

likelihood P(e|H)P(H) P(H|e) = P(e) normalizing constant

Maximum likelihood Maximum a Posteriori Bayesian Estimate

P(e|hi)P(hi) hML = arg max P(e|hi) hMAP = arg max P(hi |e) P(hi |e) = hi hi P(e)

P(X|e) ≈ P(X|hML) P(X|e) ≈ P(X|hMAP) P(X|e) = ∑ P(X|hi) P(hi |e) i 7 Classification

Given data set D=, i=1:n, with discrete yi, find a hypothesis which “best fits” the data.

If yi ∈ {0, 1} this is binary classification.

If yi can take more than two values, the problem is called multi-class classification.

8 Applications for Classification

• Text classification (spam filtering, news filtering, building web dir, etc.)

• Image classification (face detection, object recognition, etc.)

• Prediction of cancer recurrence.

• Financial forecasting.

• Many, many more!

9 Simple Example

• Given “nucleus size”, predict cancer recurrence.

• Univariate input: X = nucleus size.

• Binary output: Y = {N = 0; R = 1}

• Try: Minimize the least-square error.

N=NoRecurrence R=Recurrence

10 Predicting a Class from Linear Regression

Here red line is: Y’ = X (XTX)-1 XTY

How to get a binary output?

•Threshold the output: { y <= t for NoRecurrence, y > t for Recurrence} •Interpret output as probability: y = P(Recurrence)

In the probabilistic framework, the goal is to learn/estimate the posterior distribution P(y|x) and using it for classification.

11 Models for Binary Classification

Two probabilistic approaches:

1. Generative learning 2. Discriminative learning

12 Models for Binary Classification

Two probabilistic approaches:

1. Generative learning 2. Discriminative learning

13 Generative Learning

Separately models P(x|y) and P(y), use these, through Bayes rule to estimate P(y|x).

14 Considering Binary Classification Problems

Suppose you are trying to predict whether a movie review is positive based on the words that appear in the review. p(y = + |x) Generative learning says that to calculate P(y|x), we estimate the likelihood P(x|y) and the prior P(y) and multiply them together.

P(x|y) = P(x1, x2, …, xm |y)

= P(x1 |y) P(x2 |y, x1) P(x3 |y, x1, x2), …, P(xm |y, x1, x2, …, xm−1) (from general rules of )

The challenge is that this likelihood distribution is over a lot of parameters.

15 Naive Bayes Assumption

Assume the xj are conditionally independent given y, i.e.,

p(xj |y, xk) = p(xj |y), ∀j, k

Example: Thunder is conditionally independent of Rain given Lightning.

Lightning

Thunder Rain

P(Thunder |Rain, Lightning) = P(Thunder |Lightning)

16 Naive Bayes Assumption

Conditional Independence Assumption: the probability distribution governing a particular feature is conditionally independent to the probability distribution governing another feature, as well as any combinations/subsets of other features, given that we know the label.

y

x1 x2 x3 … xm

17 Naive Bayes Assumption

p(xj |y, xk) = p(xj |y), ∀j, k

This greatly simplifies the computation of the likelihood:

P(x|y) = P(x1, x2, …, xm |y)

= P(x1 |y) P(x2 |y, x1) P(x3 |y, x1, x2), …, P(xm |y, x1, x2, …, xm−1) (from general rules of probabilities) M = P(x1 |y) P(x2 |y) P(x3 |y)…P(xm |y) = ∏P(xi |y) (from the Naïve Bayes assumption) i=1

18 Naive Bayes Assumption

y

x1 x2 x3 … xm

How many parameters to estimate? Assume m binary features. •without NB assumption: 2(2m-1)=O(2m) numbers to describe model. •with NB assumption: 2m=O(m) numbers to describe model. Useful when the number of features is high.

19 Training a Naive Bayes Classifier

Learning = Find parameters that maximize the log-likelihood:

n m P(y|x) ∝ P(y)P(x|y) = ∏(P(yi)∏P(xi,j |yi)) i=1 j=1 •Samples i are independent, so we take product over n. •Input features are independent (cond. on y) so we take product over m.

Suppose that x, y are binary variables, m=1. y Define:

θ1 = P(y = 1) θj,1 = P(xj = 1|y = 1)

θj,0 = P(xj = 1|y = 0) x

20 Training a Naive Bayes Classifier

Likelihood of the output variable: y 1−y L(θ1) = θ1(1 − θ1) Likelihood for all parameters: n m log L(θ1, θi,1, θi,0 |D) = ∑ [ log P(yi) + ∑ log P(xi,j |yi) ] i j=1 n = ∑ [ yi log θ1 + (1 − yi) log(1 − θ1) ] i m +∑ yi(xi,j log θi,1 + (1 − xi,j)log(1 − θi,1)) j=1 m +∑ (1 − yi)(xi,j log θi,0 + (1 − xi,j)log(1 − θi,0) j=1 The log likelihood will have other form if params P(x|y) have other form. In this case, we chose Bernouilli. If the data is continuous, we can choose Gaussian. You can choose the distribution based on your knowledge of the problem. The choice of distribution is a form of inductive bias. 21 Training a Naive Bayes Classifier

To estimate θ 1 , take derivative of the log likelihood θ 1 , set to 0 and solve: n δL yi 1 − yi = ∑ ( − ) = 0 δθ1 i=1 θ1 1 − θ1 1 n We get: number of examples where y=1 θ1 = ∑ yi = n i number of examples Similarly, we get:

number of examples where xj=1 and y=1 θj,1 = frequencies of number of examples where y=1 events! number of examples where x =1 and y=0 θ = j j,0 number of examples where y=0

22 Log Likelihood Ratio

Suppose we have 2 classes: y ∊ {0, 1}

What is the probability of a given input x having class y = 1? Consider Bayes rule: P(x, y = 1) P(x|y = 1)P(y = 1) P(y = 1|x) = = P(x) P(x|y = 1)P(y = 1) + P(x|y = 0)p(y = 0) 1 1 1 = = P(x|y = 0)P(y = 0) P(x|y = 0)P(y = 0) = 1 + exp ln 1 + exp(−a) 1 + P(x|y = 1)P(y = 1) ( P(x|y = 1)P(y = 1) ) where P(x|y = 1)P(y = 1) P(y = 1|x) (By Bayes rule; P(x) on top and a = ln = ln P(x|y = 0)P(y = 0) P(y = 0|x) bottom cancels out.)

� is the log-odds ratio of data being class 1 vs. class 0. It tells us which class (y=1 vs y=0) has higher probability and we can classify x accordingly. 23 Naive Bayes Decision Boundary

What does the decision boundary look like for Naive Bayes?

Let’s start from the log-odds ratio:

P(y = 1|x) P(x|y = 1)P(y = 1) log = log P(y = 0|x) P(x|y = 0)P(y = 0) m P(y = 1) ∏j=1 P(xj |y = 1) = log + log m P(y = 0) ∏j=1 P(xj |y = 0)

m P(y = 1) P(xj |y = 1) = log + ∑ log P(y = 0) j=1 P(xj |y = 0)

24 Naive Bayes Decision Boundary

Consider the case where features are binary: xj = {0, 1}

Define: P(xj = 0|y = 1) P(xj = 1|y = 1) wj,0 = log wj,1 = log P(xj = 0|y = 0) P(xj = 1|y = 0)

Now we have: m P(y = 1|x) P(y = 1) P(xj |y = 1) log = log + ∑ log P(y = 0|x) P(y = 0) j=1 P(xj |y = 0) P(y = 1) m = log + ∑ (wj,0(1 − xj) + wj,1xj) P(y = 0) j=1 P(y = 1) m m = log + ∑ wj,0 + ∑ (wj,1 − wj,0)xj P(y = 0) j=1 j=1 This is a linear decision boundary! w0 wTx 25 Naive Bayes Decision Boundary

Results show that Naive Bayes with binary features has precisely the form of a linear model. The decision boundary is the set of inputs for which the likelihood of y=1 is precisely 50%, or P(y = 1|x) = 1 P(y = 0|x)

Instead of computing the likelihood ratio, it is easier to compute the log-likelihood ratio (LLR).

P(y = 1|x) ln = 0 P(y = 0|x)

26 Text Classification Example

Using Naïve Bayes, we can compute probabilities for all the words which appear in the document collection. • P(y=c) is the probability of class c

• P(xj|y=c) is the probability of seeing word j in documents of class c

Set of classes depends on the application, e.g. •Topic modeling: each class corresponds to documents on a given topic, e.g. {Politics, Finance, Sports, Arts}.

What happens when a word is not observed in the training data?

27 Laplace Smoothing

Replace the maximum likelihood estimator:

number of instance with xj=1 and y=1 P(x |y = 1) = j number of examples with y=1 With the following:

(number of instance with xj=1 and y=1) + 1 P(xj |y = 1) = (number of examples with y=1) + 2 # distinct values of Xj

To avoid estimates of zero, we add in a number of hallucinated examples spread even over the possible values of xj. It acts like a Bayesian prior. •If no example from that class, it reduces to a prior probability or P=1/2.

•If all examples have xj=1, then P(xj=0|y) = 1 / (#examples + 1). •If a word appears frequently, the new estimate is only slightly biased.

28 Example: 20 Newsgroups

Given 1000 training documents from each group, learn to classify new documents according to which newsgroup they came from:

Naïve Bayes: 89% classification accuracy (comparable to other state-of-the-art methods.)

29 Multiclass Classification

Generally two options: •Learn a single classifier that can produce 20 distinct output values. •Learn 20 different 1-vs-all binary classifiers.

Option 1 assumes you have a multi-class version of the classifier. For Naïve Bayes, compute P(y|x) for each class, and select the class with highest probability.

Option 2 applies to all binary classifiers, so more flexible. But often slower (need to learn many classifiers), and creates a class imbalance problem (the target class has relatively fewer data points, compared to the aggregation of the other classes.)

30 Best Features Extracted

From MSc thesis by Jason Rennie http://qwone.com/~jason/papers/sm-thesis.pdf 31 Gaussian Naive Bayes

Extending Naïve Bayes to continuous inputs: • P(y) is still assumed to be a binomial distribution. • P(x|y) is assumed to be a multivariate Gaussian (normal) distribution with mean μ∊ℜn and covariance matrix Σ ∊ ℜnxℜn – If we assume the same Σ for all classes: Linear discriminant analysis. – If Σ is distinct between classes: Quadratic discriminant analysis. – If Σ is diagonal (i.e. features are independent): Gaussian Naïve Bayes.

• How do we estimate parameters? Derive the maximum likelihood estimators for μ and Σ.

32 More Generally: Bayesian Networks

If not all conditional independence relations are true, we can introduce new arcs to show what dependencies are there. At each node, we have the conditional probability distribution of the corresponding variable, given its parents. This more general type of graph, annotated with conditional distributions, is called a .

33 Generative Story

A useful way to develop probabilistic models is to tell a generative story. Generative story is a fictional story that explains how you believe your training data came into existence. You can write down the data likelihood based on this generative story.

34 Generative Story

For example, consider a multi-class classification problem, with continuous features modelled by independent Gaussians.

For each example n = 1…N

(a) Choose a label yn ∼ Disc(θ) (b) For each feature d = 1 … D , Choose feature value 2 xn,d ∼ N(μyn,d, σyn,d) The generative story can be directly translated into a :

1 1 p(D) = θ exp[− (x − μ )2] yn 2 n,d yn,d ∏ ∏ 2 2π n d yn,d 2πσyn,d

35 Models for Binary Classification

Two probabilistic approaches:

1. Generative learning 2. Discriminative learning

36 Discriminative Learning

Model p(y|x) directly!

37 Linear Regression: the Generative Story

Suppose you “believe” the relationship between x and y is linear. Of course, not all the data will obey this. So, we can model the noise using something simple like zero-mean Gaussian noise. This leads to the following generative story: For each example n = 1…N T (a) compute tn = w xn + b 2 (b) choose noise en ∼ N(0,σ )

(c) return yn = tn + en

38 Linear Regression: the Generative Story

Based on the additivity of Gaussian distribution, we can rewrite the generative story in a shorter format:

T 2 yn ∼ N(w xn + b, σ ) From this generative story, we can write down the log likelihood of the dataset: 1 1 log p(D) = − log(σ2) − (wT x + b − y )2 ∑ [ 2 n n ] n 2 2σ 1 2 = wT x + b − y + const 2 ∑ ( n n) 2σ n

This is precisely the linear regression model!

39 Logistic Regression: the Generative Story

In the case of binary classification, using Gaussian noise model does not make sense, so we can switch to a Bernoulli model. However, the parameter of a Bernoulli is a value between zero and one, so our model must produce such values. A classic approach is to produce a real-valued target, then transform this target into a value between zero and one, so − ∞ maps to 0, and + ∞ maps to 1. A function that does this is the logistic function:

1 exp z σ(z) = = 1 + exp[−z] 1 + exp z also called “sigmoid” function due to S shape

40 Logistic Regression: the Generative Story

Using the logistic function, we can write down a generative story for binary classification: For each example n = 1…N

(a) compute tn = σ(w ⋅ xn + b)

(b) compute zn = Ber(tn) (to span -1 to +1) (c) return yn = 2zn − 1

41 Logistic Regression: the Generative Story

The log-likelihood of this model is:

T T log p(D) = ∑ yn log(σ(w xn)) + (1 − yn) log(1 − σ(w xn)) n T = ∑ log σ(yn(w xn + b)) n T = − ∑ log[1 + exp( − yn(w xn + b))] n (log) T = − ∑ l (yn, w xn + b) Logistic Loss Function n 1 llog(y, y)̂ = log(1 + exp[−yy]̂ ) log 2

The log-likelihood is precisely the negative of (a scaled version) of the logistic loss. This model is called the logistic regression model.

42 Logistic Regression

Different from the generative approach, Logistic Regression assumes an exponential parametric form for the distribution P(Y|X),.

1 σ(a) = 1 + e−a then directly estimates its parameters from the training data 1 P(Y = 1|X) = n 1 + exp(w0 + ∑i=1 wiXi)

n exp(w0 + ∑i=1 wiXi) P(Y = 0|X) = n 1 + exp(w0 + ∑i=1 wiXi)

43 Logistic Regression

To do classification, we use the log ratio:

P(Y = 0|X) 1 < P(Y = 1|X)

n n exp(w0 + ∑i=1 wiXi) 1 < n ⋅ 1 + exp(w0 + ∑ wiXi) 1 + exp(w0 + ∑i=1 wiXi) i=1 n 1 < exp(w0 + ∑ wiXi) i=1 Taking the natural log on both sides, we can assign Y=0 if X satisfies: n linear classification rule! 0 < w0 + ∑ wiXi i=1 and assigns Y=1 otherwise. 44 Logistic Regression: Decision Boundary

The decision boundary is the set of points for which the log ratio is 0:

P(Y = 0|X) ln = 0 P(Y = 1|X) or equivalently: n w0 + ∑ wiXi = 0 i=1

How do we find the weights? Need an optimization function. 45 Fitting the Weights

Recall: T σ(w xi) = probability that yi=1 (given xi) T = probability that y =0 (given x ) 1 − σ(w xi) i i

For y∊{0, 1}, the likelihood function is:

n T yi T (1−yi) (samples are i.i.d.) P(x1, y1, …, xn, yn |w) = ∏σ(w xi) (1 − σ(w xi)) i=1 Goal: Minimize the log-likelihood:

n T T −∑ yi log(σ(w xi)) + (1 − yi) log(1 − σ(w xi)) i=1

46 Gradient Descent for Logistic Regression

n T T Err(w) = − ∑ yi log(σ(w xi)) + (1 − yi) log(1 − σ(w xi)) [ i=1 ] δlog(σ)/δw=1/σ Take the derivative:

δErr(w) n 1 = − y 1 − σ(wT x ) σ(wT x )x δw ∑ i( T )( i ) i i [ i=1 σ(w xi)

47 Gradient Descent for Logistic Regression

n T T Err(w) = − ∑ yi log(σ(w xi)) + (1 − yi) log(1 − σ(w xi)) [ i=1 ]

Take the derivative: δσ/δw=σ(1-σ)

δErr(w) n 1 = − y 1 − σ(wT x ) σ(wT x )x δw ∑ i( T )( i ) i i [ i=1 σ(w xi)

48 Gradient Descent for Logistic Regression

n T T Err(w) = − ∑ yi log(σ(w xi)) + (1 − yi) log(1 − σ(w xi)) [ i=1 ]

Take the derivative: δwTx/δw=x

δErr(w) n 1 = − y 1 − σ(wT x ) σ(wT x )x δw ∑ i( T )( i ) i i [ i=1 σ(w xi)

49 Gradient Descent for Logistic Regression

n T T Err(w) = − ∑ yi log(σ(w xi)) + (1 − yi) log(1 − σ(w xi)) [ i=1 ]

δ(1-σ)/δw= Take the derivative: (1-σ)σ(-1) δErr(w) n 1 = − y 1 − σ(wT x ) σ(wT x )x δw ∑ i( T )( i ) i i [ i=1 σ(w xi) 1 +(1 − y ) 1 − σ(wT x ) σ(wT x )(−1)x i ( T )( i ) i i 1 − σ(w xi) ]

50 Gradient Descent for Logistic Regression

n T T Err(w) = − ∑ yi log(σ(w xi)) + (1 − yi) log(1 − σ(w xi)) [ i=1 ]

Take the derivative: n δErr(w) T = −∑ xi(yi − σ(w xi)) δw i=1

Apply iteratively to update the weight vector:

T wi+1 = wi + αkxi(yi − σ(w xi))

51 Logistic Regression: Regularization

1 log p(w|x, y) = l(log)(y , wT x + b) − ||w||2 ∑ n n 2 n 2σ

Maximizing this log posterior corresponds to calculating the MAP estimate for w, under the assumption that the prior distribution P(W) is a normal distribution with mean zero and variance = 1/ λ .

52 Logistic Regression

53 Logistic Regression <=> GNB

There is an interesting relation to Gaussian Naive Bayes classifier.

Reminder about GNB assumptions: •Y is boolean, governed by Bernoulli distribution, with π = P(Y = 1) • where each Xi is a continuous random variable X = ⟨X1, …, Xn⟩ •For each Xi, P ( X i | Y = y k ) is a Gaussian distribution N(μik, σi)

•For all i and j != i, Xi and Xj are conditionally independent given Y

And so, the parameters we need to estimate include:

μik = �[Xi |Y = yk] πk = P(Y = yk) 2 2 σik = �[(Xi − μik) |Y = yk]

54 Logistic Regression <=> GNB

Using Bayes Rule, we can write

P(Y = 1)P(X|Y = 1) P(Y = 1|X) = P(Y = 1)P(X|Y = 1) + P(Y = 0)P(X|Y = 0)

55 Logistic Regression <=> GNB

1 P(Y = 1|X) = P(Y = 0)P(X|Y = 0) 1 + P(Y = 1)P(X|Y = 1) 1 = P(Y = 0)P(X|Y = 0) 1 + exp ln ( P(Y = 1)P(X|Y = 1) ) 1 = P(Y = 0) P(X |Y = 0) 1 + exp ln + ∑ ln i ( P(Y = 1) i P(Xi |Y = 1) ) 1 = 1 − π P(Xi |Y = 0) (*) 1 + exp ln + ∑ ln ( π i P(Xi |Y = 1) )

56 Logistic Regression <=> GNB

2 1 −(Xi − μi0) exp 2 2 ( 2σi ) P(X |Y = 0) 2πσi ln i = ln ∑ ∑ −(X − μ )2 P(Xi |Y = 1) 1 i i1 i i exp 2 2 ( 2σi ) 2πσi (X − μ )2 − (X − μ )2 = ln exp i i1 i i0 ∑ ( 2 ) i 2σi (X − μ )2 − (X − μ )2 = i i1 i i0 ∑ ( 2 ) i 2σi (X2 − 2X μ + μ2 ) − (X2 − 2X μ + μ2 ) = i i i1 i1 i i i0 i0 ∑ 2 i 2σi 2X (μ − μ ) + μ2 − μ2 = i i0 i1 i1 i0 ∑ 2 i 2σi μ − μ μ2 − μ2 = i0 i1 X + i1 i0 ∑ ( 2 i 2 ) Substitute back into (*) i σi 2σi

57 Logistic Regression <=> GNB

1 P(Y = 1|X) = P(X |Y = 0) 1 + exp ln 1 − π + ∑ ln i ( π i P(Xi |Y = 1) ) 1 = 2 2 1 − π μi0 − μi1 μi1 − μi0 1 + exp ln π + ∑i ( 2 Xi + 2 ) ( σi 2σi ) 1 = n 1 + exp w0 + ∑ wiXi ( i=1 ) where 2 2 1 − π μ − μ μi0 − μi1 w = ln + i1 i0 w = 0 ∑ 2 i 2 π i 2σi σi

58 Logistic Regression <=> GNB

The parametric form of P(Y|X) used by Logistic Regression is precisely the form implied by the assumptions of a Gaussian Naive Bayes classifier! In other words, Logistic Regression is the discriminant equivalent of Gaussian Naive Bayes. Therefore, we can view Logistic Regression as a closely related alternative to GNB, though the two can produce different results in many cases.

59 Logistic Regression <=> GNB

Logistic Regression (LR) is called a discriminative classifier; it directly estimates the parameters of P(y|x). Naive Bayes (NB) is called a generative classifier; it estimates the parameters for P(y) and P(x|y), and use these quantities to compute P(y|x). We showed that one variant of NB, Gaussian Naive Bayes (GNB), asymptotically converges to an identical classifier as logistic regression. •When the GNB assumptions do not hold, LR and GNB leads to different classifiers, with LR often better than GNB. •GNB and LR converge toward their asymptotic accuracies at different rates. e.g., Ng & Jordan showed that LR outperforms GNB in many datasets when many training examples are available, but GNB outperforms LR when training data is sparse.

https://ai.stanford.edu/~ang/papers/nips01-discriminativegenerative.pdf 60 Models for Binary Classification

Two probabilistic approaches:

1. Generative learning: separately models P(x|y) and P(y), use these, through Bayes rule to estimate P(y|x).

P(x|y = 1) P(y = 1) P(y = 1|x) = P(x)

2. Discriminative learning: directly estimate P(y|x)

Logistic Regression 1 P(y|x) = σ(wx) = 1 + e−wT x

61 What you should know

• basic definition of classification problem • naive bayes assumption and how to estimate parameters for NB • log ratio decision boundary • Laplace smoothing • derivation of logistic regression • relationship between Naïve Bayes, Gaussian Naïve Bayes, logistic regression.

62