CS480 Introduction to Machine Learning Linear Models for Classification
Edith Law
Based on slides from Joelle Pineau Probabilistic Framework for Classification
As we saw in the last lecture, learning can be posed a problem of statistical inference. The premise is that for any learning problem, if we have access to the underlying probability distribution D, then we could form a Bayes optimal classifier as:
f (BO)(x)̂ = arg max D(x,̂ y)̂ y∈̂ Y
But since we don't have access to D, we can instead estimate this distribution from data. The probabilistic framework allows you to choose a model to represent your data, and develop algorithms that have inductive biases that are closer to what you, as a designer, believe.
2 Classification by Density Estimation
The most direct way to construct such a probability distribution is to select a family of parametric distributions. For example, we can select a Gaussian (or Normal) distribution. Gaussian is parametric: its parameters are its mean and variance. The job of learning is to infer which parameters are “best” in terms of describing the observed training data. In density estimation, we assume that the sample data is drawn independently from the same distribution D. That is, your nth sample (xn, yn) from D does not depend on the previous n-1 samples. This is called the i.i.d. (independently and identically distributed) assumption.
3 Consider the Problem of Flipping a Biased Coin
You are flipping a coin, and you want to find out if it is biased.
This coin has some unknown probability of β of coming up Head (H), and some probability ( 1 − β ) of coming up Tail (T). Suppose you flip the coin and observed HHTH. Assuming that each flip is independent, the probability of the data (i.e., the coin flip sequence) is
p(D|β) = p(HHTH|β) = β ⋅ β ⋅ (1 − β) ⋅ β = β3(1 − β) = β3 − β4
4 Statistical Estimation
To find the most likely value of β , you want to select parameters to maximize the the probability of the data.
This means taking the derivative of p ( D | β ) with respect to β , set it to zero and solve for δ [β3 − β4] = 3β2 − 4β2 δβ 4β3 = 3β2 ⟺ 4β = 3 3 ⟺ β = 4 This gives you the maximum likelihood estimate of the parameters of your model.
5 Statistical Estimation
But notice that the likelihood of the data p ( D | β ) actually has the form of the binomial distribution: βH(1 − β)T
The log likelihood of the binomial distribution can be rewritten log p(D|β) = log βH(1 − β)T = H log β + T log(1 − β)
When we differentiate with respect to beta, and set to zero. δ H T p(D|β) = − δβ β 1 − β H ⇒ β = The estimate is equivalent to the fraction of H + T the observed data that comes up heads! Maximum Likelihood Estimate is nothing more than the relative frequency of observing a certain event (e.g., observing heads). 6 From Estimation to Inference
likelihood prior probability P(e|H)P(H) P(H|e) = P(e) posterior probability normalizing constant
Maximum likelihood Maximum a Posteriori Bayesian Estimate
P(e|hi)P(hi) hML = arg max P(e|hi) hMAP = arg max P(hi |e) P(hi |e) = hi hi P(e)
P(X|e) ≈ P(X|hML) P(X|e) ≈ P(X|hMAP) P(X|e) = ∑ P(X|hi) P(hi |e) i 7 Classification
Given data set D=
If yi ∈ {0, 1} this is binary classification.
If yi can take more than two values, the problem is called multi-class classification.
8 Applications for Classification
• Text classification (spam filtering, news filtering, building web dir, etc.)
• Image classification (face detection, object recognition, etc.)
• Prediction of cancer recurrence.
• Financial forecasting.
• Many, many more!
9 Simple Example
• Given “nucleus size”, predict cancer recurrence.
• Univariate input: X = nucleus size.
• Binary output: Y = {N = 0; R = 1}
• Try: Minimize the least-square error.
N=NoRecurrence R=Recurrence
10 Predicting a Class from Linear Regression
Here red line is: Y’ = X (XTX)-1 XTY
How to get a binary output?
•Threshold the output: { y <= t for NoRecurrence, y > t for Recurrence} •Interpret output as probability: y = P(Recurrence)
In the probabilistic framework, the goal is to learn/estimate the posterior distribution P(y|x) and using it for classification.
11 Models for Binary Classification
Two probabilistic approaches:
1. Generative learning 2. Discriminative learning
12 Models for Binary Classification
Two probabilistic approaches:
1. Generative learning 2. Discriminative learning
13 Generative Learning
Separately models P(x|y) and P(y), use these, through Bayes rule to estimate P(y|x).
14 Considering Binary Classification Problems
Suppose you are trying to predict whether a movie review is positive based on the words that appear in the review. p(y = + |x) Generative learning says that to calculate P(y|x), we estimate the likelihood P(x|y) and the prior P(y) and multiply them together.
P(x|y) = P(x1, x2, …, xm |y)
= P(x1 |y) P(x2 |y, x1) P(x3 |y, x1, x2), …, P(xm |y, x1, x2, …, xm−1) (from general rules of probabilities)
The challenge is that this likelihood distribution is over a lot of parameters.
15 Naive Bayes Assumption
Assume the xj are conditionally independent given y, i.e.,
p(xj |y, xk) = p(xj |y), ∀j, k
Example: Thunder is conditionally independent of Rain given Lightning.
Lightning
Thunder Rain
P(Thunder |Rain, Lightning) = P(Thunder |Lightning)
16 Naive Bayes Assumption
Conditional Independence Assumption: the probability distribution governing a particular feature is conditionally independent to the probability distribution governing another feature, as well as any combinations/subsets of other features, given that we know the label.
y
x1 x2 x3 … xm
17 Naive Bayes Assumption
p(xj |y, xk) = p(xj |y), ∀j, k
This greatly simplifies the computation of the likelihood:
P(x|y) = P(x1, x2, …, xm |y)
= P(x1 |y) P(x2 |y, x1) P(x3 |y, x1, x2), …, P(xm |y, x1, x2, …, xm−1) (from general rules of probabilities) M = P(x1 |y) P(x2 |y) P(x3 |y)…P(xm |y) = ∏P(xi |y) (from the Naïve Bayes assumption) i=1
18 Naive Bayes Assumption
y
x1 x2 x3 … xm
How many parameters to estimate? Assume m binary features. •without NB assumption: 2(2m-1)=O(2m) numbers to describe model. •with NB assumption: 2m=O(m) numbers to describe model. Useful when the number of features is high.
19 Training a Naive Bayes Classifier
Learning = Find parameters that maximize the log-likelihood:
n m P(y|x) ∝ P(y)P(x|y) = ∏(P(yi)∏P(xi,j |yi)) i=1 j=1 •Samples i are independent, so we take product over n. •Input features are independent (cond. on y) so we take product over m.
Suppose that x, y are binary variables, m=1. y Define:
θ1 = P(y = 1) θj,1 = P(xj = 1|y = 1)
θj,0 = P(xj = 1|y = 0) x
20 Training a Naive Bayes Classifier
Likelihood of the output variable: y 1−y L(θ1) = θ1(1 − θ1) Likelihood for all parameters: n m log L(θ1, θi,1, θi,0 |D) = ∑ [ log P(yi) + ∑ log P(xi,j |yi) ] i j=1 n = ∑ [ yi log θ1 + (1 − yi) log(1 − θ1) ] i m +∑ yi(xi,j log θi,1 + (1 − xi,j)log(1 − θi,1)) j=1 m +∑ (1 − yi)(xi,j log θi,0 + (1 − xi,j)log(1 − θi,0) j=1 The log likelihood will have other form if params P(x|y) have other form. In this case, we chose Bernouilli. If the data is continuous, we can choose Gaussian. You can choose the distribution based on your knowledge of the problem. The choice of distribution is a form of inductive bias. 21 Training a Naive Bayes Classifier
To estimate θ 1 , take derivative of the log likelihood θ 1 , set to 0 and solve: n δL yi 1 − yi = ∑ ( − ) = 0 δθ1 i=1 θ1 1 − θ1 1 n We get: number of examples where y=1 θ1 = ∑ yi = n i number of examples Similarly, we get:
number of examples where xj=1 and y=1 θj,1 = frequencies of number of examples where y=1 events! number of examples where x =1 and y=0 θ = j j,0 number of examples where y=0
22 Log Likelihood Ratio
Suppose we have 2 classes: y ∊ {0, 1}
What is the probability of a given input x having class y = 1? Consider Bayes rule: P(x, y = 1) P(x|y = 1)P(y = 1) P(y = 1|x) = = P(x) P(x|y = 1)P(y = 1) + P(x|y = 0)p(y = 0) 1 1 1 = = P(x|y = 0)P(y = 0) P(x|y = 0)P(y = 0) = 1 + exp ln 1 + exp(−a) 1 + P(x|y = 1)P(y = 1) ( P(x|y = 1)P(y = 1) ) where P(x|y = 1)P(y = 1) P(y = 1|x) (By Bayes rule; P(x) on top and a = ln = ln P(x|y = 0)P(y = 0) P(y = 0|x) bottom cancels out.)
� is the log-odds ratio of data being class 1 vs. class 0. It tells us which class (y=1 vs y=0) has higher probability and we can classify x accordingly. 23 Naive Bayes Decision Boundary
What does the decision boundary look like for Naive Bayes?
Let’s start from the log-odds ratio:
P(y = 1|x) P(x|y = 1)P(y = 1) log = log P(y = 0|x) P(x|y = 0)P(y = 0) m P(y = 1) ∏j=1 P(xj |y = 1) = log + log m P(y = 0) ∏j=1 P(xj |y = 0)
m P(y = 1) P(xj |y = 1) = log + ∑ log P(y = 0) j=1 P(xj |y = 0)
24 Naive Bayes Decision Boundary
Consider the case where features are binary: xj = {0, 1}
Define: P(xj = 0|y = 1) P(xj = 1|y = 1) wj,0 = log wj,1 = log P(xj = 0|y = 0) P(xj = 1|y = 0)
Now we have: m P(y = 1|x) P(y = 1) P(xj |y = 1) log = log + ∑ log P(y = 0|x) P(y = 0) j=1 P(xj |y = 0) P(y = 1) m = log + ∑ (wj,0(1 − xj) + wj,1xj) P(y = 0) j=1 P(y = 1) m m = log + ∑ wj,0 + ∑ (wj,1 − wj,0)xj P(y = 0) j=1 j=1 This is a linear decision boundary! w0 wTx 25 Naive Bayes Decision Boundary
Results show that Naive Bayes with binary features has precisely the form of a linear model. The decision boundary is the set of inputs for which the likelihood of y=1 is precisely 50%, or P(y = 1|x) = 1 P(y = 0|x)
Instead of computing the likelihood ratio, it is easier to compute the log-likelihood ratio (LLR).
P(y = 1|x) ln = 0 P(y = 0|x)
26 Text Classification Example
Using Naïve Bayes, we can compute probabilities for all the words which appear in the document collection. • P(y=c) is the probability of class c
• P(xj|y=c) is the probability of seeing word j in documents of class c
Set of classes depends on the application, e.g. •Topic modeling: each class corresponds to documents on a given topic, e.g. {Politics, Finance, Sports, Arts}.
What happens when a word is not observed in the training data?
27 Laplace Smoothing
Replace the maximum likelihood estimator:
number of instance with xj=1 and y=1 P(x |y = 1) = j number of examples with y=1 With the following:
(number of instance with xj=1 and y=1) + 1 P(xj |y = 1) = (number of examples with y=1) + 2 # distinct values of Xj
To avoid estimates of zero, we add in a number of hallucinated examples spread even over the possible values of xj. It acts like a Bayesian prior. •If no example from that class, it reduces to a prior probability or P=1/2.
•If all examples have xj=1, then P(xj=0|y) = 1 / (#examples + 1). •If a word appears frequently, the new estimate is only slightly biased.
28 Example: 20 Newsgroups
Given 1000 training documents from each group, learn to classify new documents according to which newsgroup they came from:
Naïve Bayes: 89% classification accuracy (comparable to other state-of-the-art methods.)
29 Multiclass Classification
Generally two options: •Learn a single classifier that can produce 20 distinct output values. •Learn 20 different 1-vs-all binary classifiers.
Option 1 assumes you have a multi-class version of the classifier. For Naïve Bayes, compute P(y|x) for each class, and select the class with highest probability.
Option 2 applies to all binary classifiers, so more flexible. But often slower (need to learn many classifiers), and creates a class imbalance problem (the target class has relatively fewer data points, compared to the aggregation of the other classes.)
30 Best Features Extracted
From MSc thesis by Jason Rennie http://qwone.com/~jason/papers/sm-thesis.pdf 31 Gaussian Naive Bayes
Extending Naïve Bayes to continuous inputs: • P(y) is still assumed to be a binomial distribution. • P(x|y) is assumed to be a multivariate Gaussian (normal) distribution with mean μ∊ℜn and covariance matrix Σ ∊ ℜnxℜn – If we assume the same Σ for all classes: Linear discriminant analysis. – If Σ is distinct between classes: Quadratic discriminant analysis. – If Σ is diagonal (i.e. features are independent): Gaussian Naïve Bayes.
• How do we estimate parameters? Derive the maximum likelihood estimators for μ and Σ.
32 More Generally: Bayesian Networks
If not all conditional independence relations are true, we can introduce new arcs to show what dependencies are there. At each node, we have the conditional probability distribution of the corresponding variable, given its parents. This more general type of graph, annotated with conditional distributions, is called a Bayesian network.
33 Generative Story
A useful way to develop probabilistic models is to tell a generative story. Generative story is a fictional story that explains how you believe your training data came into existence. You can write down the data likelihood based on this generative story.
34 Generative Story
For example, consider a multi-class classification problem, with continuous features modelled by independent Gaussians.
For each example n = 1…N
(a) Choose a label yn ∼ Disc(θ) (b) For each feature d = 1 … D , Choose feature value 2 xn,d ∼ N(μyn,d, σyn,d) The generative story can be directly translated into a likelihood function:
1 1 p(D) = θ exp[− (x − μ )2] yn 2 n,d yn,d ∏ ∏ 2 2π n d yn,d 2πσyn,d
35 Models for Binary Classification
Two probabilistic approaches:
1. Generative learning 2. Discriminative learning
36 Discriminative Learning
Model p(y|x) directly!
37 Linear Regression: the Generative Story
Suppose you “believe” the relationship between x and y is linear. Of course, not all the data will obey this. So, we can model the noise using something simple like zero-mean Gaussian noise. This leads to the following generative story: For each example n = 1…N T (a) compute tn = w xn + b 2 (b) choose noise en ∼ N(0,σ )
(c) return yn = tn + en
38 Linear Regression: the Generative Story
Based on the additivity of Gaussian distribution, we can rewrite the generative story in a shorter format:
T 2 yn ∼ N(w xn + b, σ ) From this generative story, we can write down the log likelihood of the dataset: 1 1 log p(D) = − log(σ2) − (wT x + b − y )2 ∑ [ 2 n n ] n 2 2σ 1 2 = wT x + b − y + const 2 ∑ ( n n) 2σ n
This is precisely the linear regression model!
39 Logistic Regression: the Generative Story
In the case of binary classification, using Gaussian noise model does not make sense, so we can switch to a Bernoulli model. However, the parameter of a Bernoulli is a value between zero and one, so our model must produce such values. A classic approach is to produce a real-valued target, then transform this target into a value between zero and one, so − ∞ maps to 0, and + ∞ maps to 1. A function that does this is the logistic function:
1 exp z σ(z) = = 1 + exp[−z] 1 + exp z also called “sigmoid” function due to S shape
40 Logistic Regression: the Generative Story
Using the logistic function, we can write down a generative story for binary classification: For each example n = 1…N
(a) compute tn = σ(w ⋅ xn + b)
(b) compute zn = Ber(tn) (to span -1 to +1) (c) return yn = 2zn − 1
41 Logistic Regression: the Generative Story
The log-likelihood of this model is:
T T log p(D) = ∑ yn log(σ(w xn)) + (1 − yn) log(1 − σ(w xn)) n T = ∑ log σ(yn(w xn + b)) n T = − ∑ log[1 + exp( − yn(w xn + b))] n (log) T = − ∑ l (yn, w xn + b) Logistic Loss Function n 1 llog(y, y)̂ = log(1 + exp[−yy]̂ ) log 2
The log-likelihood is precisely the negative of (a scaled version) of the logistic loss. This model is called the logistic regression model.
42 Logistic Regression
Different from the generative approach, Logistic Regression assumes an exponential parametric form for the distribution P(Y|X),.
1 σ(a) = 1 + e−a then directly estimates its parameters from the training data 1 P(Y = 1|X) = n 1 + exp(w0 + ∑i=1 wiXi)
n exp(w0 + ∑i=1 wiXi) P(Y = 0|X) = n 1 + exp(w0 + ∑i=1 wiXi)
43 Logistic Regression
To do classification, we use the log ratio:
P(Y = 0|X) 1 < P(Y = 1|X)
n n exp(w0 + ∑i=1 wiXi) 1 < n ⋅ 1 + exp(w0 + ∑ wiXi) 1 + exp(w0 + ∑i=1 wiXi) i=1 n 1 < exp(w0 + ∑ wiXi) i=1 Taking the natural log on both sides, we can assign Y=0 if X satisfies: n linear classification rule! 0 < w0 + ∑ wiXi i=1 and assigns Y=1 otherwise. 44 Logistic Regression: Decision Boundary
The decision boundary is the set of points for which the log ratio is 0:
P(Y = 0|X) ln = 0 P(Y = 1|X) or equivalently: n w0 + ∑ wiXi = 0 i=1
How do we find the weights? Need an optimization function. 45 Fitting the Weights
Recall: T σ(w xi) = probability that yi=1 (given xi) T = probability that y =0 (given x ) 1 − σ(w xi) i i
For y∊{0, 1}, the likelihood function is:
n T yi T (1−yi) (samples are i.i.d.) P(x1, y1, …, xn, yn |w) = ∏σ(w xi) (1 − σ(w xi)) i=1 Goal: Minimize the log-likelihood:
n T T −∑ yi log(σ(w xi)) + (1 − yi) log(1 − σ(w xi)) i=1
46 Gradient Descent for Logistic Regression
n T T Err(w) = − ∑ yi log(σ(w xi)) + (1 − yi) log(1 − σ(w xi)) [ i=1 ] δlog(σ)/δw=1/σ Take the derivative:
δErr(w) n 1 = − y 1 − σ(wT x ) σ(wT x )x δw ∑ i( T )( i ) i i [ i=1 σ(w xi)
47 Gradient Descent for Logistic Regression
n T T Err(w) = − ∑ yi log(σ(w xi)) + (1 − yi) log(1 − σ(w xi)) [ i=1 ]
Take the derivative: δσ/δw=σ(1-σ)
δErr(w) n 1 = − y 1 − σ(wT x ) σ(wT x )x δw ∑ i( T )( i ) i i [ i=1 σ(w xi)
48 Gradient Descent for Logistic Regression
n T T Err(w) = − ∑ yi log(σ(w xi)) + (1 − yi) log(1 − σ(w xi)) [ i=1 ]
Take the derivative: δwTx/δw=x
δErr(w) n 1 = − y 1 − σ(wT x ) σ(wT x )x δw ∑ i( T )( i ) i i [ i=1 σ(w xi)
49 Gradient Descent for Logistic Regression
n T T Err(w) = − ∑ yi log(σ(w xi)) + (1 − yi) log(1 − σ(w xi)) [ i=1 ]
δ(1-σ)/δw= Take the derivative: (1-σ)σ(-1) δErr(w) n 1 = − y 1 − σ(wT x ) σ(wT x )x δw ∑ i( T )( i ) i i [ i=1 σ(w xi) 1 +(1 − y ) 1 − σ(wT x ) σ(wT x )(−1)x i ( T )( i ) i i 1 − σ(w xi) ]
50 Gradient Descent for Logistic Regression
n T T Err(w) = − ∑ yi log(σ(w xi)) + (1 − yi) log(1 − σ(w xi)) [ i=1 ]
Take the derivative: n δErr(w) T = −∑ xi(yi − σ(w xi)) δw i=1
Apply iteratively to update the weight vector:
T wi+1 = wi + αkxi(yi − σ(w xi))
51 Logistic Regression: Regularization
1 log p(w|x, y) = l(log)(y , wT x + b) − ||w||2 ∑ n n 2 n 2σ
Maximizing this log posterior corresponds to calculating the MAP estimate for w, under the assumption that the prior distribution P(W) is a normal distribution with mean zero and variance = 1/ λ .
52 Logistic Regression
53 Logistic Regression <=> GNB
There is an interesting relation to Gaussian Naive Bayes classifier.
Reminder about GNB assumptions: •Y is boolean, governed by Bernoulli distribution, with π = P(Y = 1) • where each Xi is a continuous random variable X = ⟨X1, …, Xn⟩ •For each Xi, P ( X i | Y = y k ) is a Gaussian distribution N(μik, σi)
•For all i and j != i, Xi and Xj are conditionally independent given Y
And so, the parameters we need to estimate include:
μik = �[Xi |Y = yk] πk = P(Y = yk) 2 2 σik = �[(Xi − μik) |Y = yk]
54 Logistic Regression <=> GNB
Using Bayes Rule, we can write
P(Y = 1)P(X|Y = 1) P(Y = 1|X) = P(Y = 1)P(X|Y = 1) + P(Y = 0)P(X|Y = 0)
55 Logistic Regression <=> GNB
1 P(Y = 1|X) = P(Y = 0)P(X|Y = 0) 1 + P(Y = 1)P(X|Y = 1) 1 = P(Y = 0)P(X|Y = 0) 1 + exp ln ( P(Y = 1)P(X|Y = 1) ) 1 = P(Y = 0) P(X |Y = 0) 1 + exp ln + ∑ ln i ( P(Y = 1) i P(Xi |Y = 1) ) 1 = 1 − π P(Xi |Y = 0) (*) 1 + exp ln + ∑ ln ( π i P(Xi |Y = 1) )
56 Logistic Regression <=> GNB
2 1 −(Xi − μi0) exp 2 2 ( 2σi ) P(X |Y = 0) 2πσi ln i = ln ∑ ∑ −(X − μ )2 P(Xi |Y = 1) 1 i i1 i i exp 2 2 ( 2σi ) 2πσi (X − μ )2 − (X − μ )2 = ln exp i i1 i i0 ∑ ( 2 ) i 2σi (X − μ )2 − (X − μ )2 = i i1 i i0 ∑ ( 2 ) i 2σi (X2 − 2X μ + μ2 ) − (X2 − 2X μ + μ2 ) = i i i1 i1 i i i0 i0 ∑ 2 i 2σi 2X (μ − μ ) + μ2 − μ2 = i i0 i1 i1 i0 ∑ 2 i 2σi μ − μ μ2 − μ2 = i0 i1 X + i1 i0 ∑ ( 2 i 2 ) Substitute back into (*) i σi 2σi
57 Logistic Regression <=> GNB
1 P(Y = 1|X) = P(X |Y = 0) 1 + exp ln 1 − π + ∑ ln i ( π i P(Xi |Y = 1) ) 1 = 2 2 1 − π μi0 − μi1 μi1 − μi0 1 + exp ln π + ∑i ( 2 Xi + 2 ) ( σi 2σi ) 1 = n 1 + exp w0 + ∑ wiXi ( i=1 ) where 2 2 1 − π μ − μ μi0 − μi1 w = ln + i1 i0 w = 0 ∑ 2 i 2 π i 2σi σi
58 Logistic Regression <=> GNB
The parametric form of P(Y|X) used by Logistic Regression is precisely the form implied by the assumptions of a Gaussian Naive Bayes classifier! In other words, Logistic Regression is the discriminant equivalent of Gaussian Naive Bayes. Therefore, we can view Logistic Regression as a closely related alternative to GNB, though the two can produce different results in many cases.
59 Logistic Regression <=> GNB
Logistic Regression (LR) is called a discriminative classifier; it directly estimates the parameters of P(y|x). Naive Bayes (NB) is called a generative classifier; it estimates the parameters for P(y) and P(x|y), and use these quantities to compute P(y|x). We showed that one variant of NB, Gaussian Naive Bayes (GNB), asymptotically converges to an identical classifier as logistic regression. •When the GNB assumptions do not hold, LR and GNB leads to different classifiers, with LR often better than GNB. •GNB and LR converge toward their asymptotic accuracies at different rates. e.g., Ng & Jordan showed that LR outperforms GNB in many datasets when many training examples are available, but GNB outperforms LR when training data is sparse.
https://ai.stanford.edu/~ang/papers/nips01-discriminativegenerative.pdf 60 Models for Binary Classification
Two probabilistic approaches:
1. Generative learning: separately models P(x|y) and P(y), use these, through Bayes rule to estimate P(y|x).
P(x|y = 1) P(y = 1) P(y = 1|x) = P(x)
2. Discriminative learning: directly estimate P(y|x)
Logistic Regression 1 P(y|x) = σ(wx) = 1 + e−wT x
61 What you should know
• basic definition of classification problem • naive bayes assumption and how to estimate parameters for NB • log ratio decision boundary • Laplace smoothing • derivation of logistic regression • relationship between Naïve Bayes, Gaussian Naïve Bayes, logistic regression.
62