<<

Binary Classification /

Nicholas Ruozzi University of Texas at Dallas

Slides adapted from David Sontag and Vibhav Gogate

• Input: 푥1, 푦1 , … , (푥푛, 푦푛)

푡ℎ 푡ℎ – 푥𝑖 is the 푖 data item and 푦𝑖 is the 푖 label

• Goal: find a 푓 such that 푓 푥𝑖 is a “good approximation” to 푦𝑖 – Can use it to predict 푦 values for previously unseen 푥 values Examples of Supervised Learning

• Spam email detection

• Handwritten digit recognition

• Stock market prediction

• More? Supervised Learning

• Hypothesis space: set of allowable functions 푓: 푋 → 푌

• Goal: find the “best” element of the hypothesis space

– How do we measure the quality of 푓? Regression

Hypothesis class: linear functions 푓 푥 = 푎푥 + 푏

Squared used to measure the error of the approximation

• In typical regression applications, measure the fit using a squared loss function 2 퐿 푓, 푦𝑖 = 푓 푥𝑖 − 푦𝑖 • Want to minimize the average loss on the training data • For linear regression, the learning problem is then

1 2 min 푎푥𝑖 + 푏 − 푦𝑖 푎,푏 푛 𝑖 • For an unseen data point, 푥, the learning predicts 푓(푥)

(1) (푛) 푚 • Input 푥 , 푦1 , … , (푥 , 푦푛) with 푥𝑖 ∈ ℝ and 푦𝑖 ∈ {−1, +1} • We can think of the observations as points in ℝ푚 with an associated sign (either +/- corresponding to 0/1)

• An example with 푚 = 2

+ + + _ + _ + _ + _ + _ + + _ _ + _ + _ _ + Binary Classification

(1) (푛) 푚 • Input 푥 , 푦1 , … , (푥 , 푦푛) with 푥𝑖 ∈ ℝ and 푦𝑖 ∈ {−1, +1} • We can think of the observations as points in ℝ푚 with an associated sign (either +/- corresponding to 0/1)

• An example with 푚 = 2

+ + + _ + _ What is a good + _ hypothesis space for + _ + _ + this problem? + _ _ + _ + _ _ + Binary Classification

(1) (푛) 푚 • Input 푥 , 푦1 , … , (푥 , 푦푛) with 푥𝑖 ∈ ℝ and 푦𝑖 ∈ {−1, +1} • We can think of the observations as points in ℝ푚 with an associated sign (either +/- corresponding to 0/1)

• An example with 푚 = 2

+ + + + + + _ What is a good + + _ _ hypothesis space for + _ + _ this problem? + _ _ _ + _ _ Binary Classification

(1) (푛) 푚 • Input 푥 , 푦1 , … , (푥 , 푦푛) with 푥𝑖 ∈ ℝ and 푦𝑖 ∈ {−1, +1} • We can think of the observations as points in ℝ푚 with an associated sign (either +/- corresponding to 0/1)

• An example with 푚 = 2

+ + + + + + _ What is a good + + _ _ hypothesis space for + _ + _ this problem? + _ _ _ + _ _

In this case, we say that the observations are linearly separable Linear Separators

• In 푛 dimensions, a hyperplane is a solution to the equation 푤푇푥 + 푏 = 0 with 푤 ∈ ℝ푛, 푏 ∈ ℝ • Hyperplanes divide ℝ푛 into two distinct sets of points (called halfspaces) 푤푇푥 + 푏 > 0 푤푇푥 + 푏 < 0 Binary Classification

(1) (푛) 푚 • Input 푥 , 푦1 , … , (푥 , 푦푛) with 푥𝑖 ∈ ℝ and 푦𝑖 ∈ {−1, +1} • We can think of the observations as points in ℝ푚 with an associated sign (either +/- corresponding to 0/1)

• An example with 푚 = 2

+ + + + + + _ What is a good + + _ _ hypothesis space for + _ + _ this problem? + _ _ _ + _ _

In this case, we say that the observations are linearly separable The Linearly Separable Case

(1) (푛) 푚 • Input 푥 , 푦1 , … , (푥 , 푦푛) with 푥𝑖 ∈ ℝ and 푦𝑖 ∈ {−1,1} • Hypothesis space: separating hyperplanes 푓 푥 = 푤푇푥 + 푏 • How should we choose the loss function? The Linearly Separable Case

(1) (푛) 푚 • Input 푥 , 푦1 , … , (푥 , 푦푛) with 푥𝑖 ∈ ℝ and 푦𝑖 ∈ {−1,1} • Hypothesis space: separating hyperplanes

푇 푓푤,푏 푥 = 푤 푥 + 푏 • How should we choose the loss function?

– Count the number of misclassifications

(𝑖) 푙표푠푠 = |푦𝑖 − 푠푖푔푛(푓푤,푏(푥 ))| 𝑖

• Tough to optimize, contains no information The Linearly Separable Case

(1) (푛) 푚 • Input 푥 , 푦1 , … , (푥 , 푦푛) with 푥𝑖 ∈ ℝ and 푦𝑖 ∈ {−1,1} • Hypothesis space: separating hyperplanes

푇 푓푤,푏 푥 = 푤 푥 + 푏 • How should we choose the loss function?

– Penalize each misclassification by the size of the violation

(𝑖) 푝푒푟푐푒푝푡푟표푛 푙표푠푠 = max 0, −푦𝑖푓푤,푏 푥 𝑖

• Modified hinge loss (this loss is convex, but not differentiable) The Perceptron Algorithm

• Try to minimize the perceptron loss using (sub)

(𝑖) 훻푤(푝푒푟푐푒푝푡푟표푛 푙표푠푠) = −푦𝑖푥 (푖) 𝑖:−푦푖푓푤,푏 푥 ≥0

훻푏(푝푒푟푐푒푝푡푟표푛 푙표푠푠) = −푦𝑖 (푖) 𝑖:−푦푖푓푤,푏 푥 ≥0 Subgradients

• For a 푔(푥), a subgradient at a point 푥0 is any tangent line/plane through the point 푥0 that underestimates the function everywhere

푔(푥)

푥 푥0 Subgradients

• For a convex function 푔(푥), a subgradient at a point 푥0 is any tangent line/plane through the point 푥0 that underestimates the function everywhere

푔(푥)

푥 푥0 Subgradients

• For a convex function 푔(푥), a subgradient at a point 푥0 is any tangent line/plane through the point 푥0 that underestimates the function everywhere

푔(푥)

푥 푥0 Subgradients

• For a convex function 푔(푥), a subgradient at a point 푥0 is any tangent line/plane through the point 푥0 that underestimates the function everywhere

푔(푥)

If 0 is subgradient at 푥0, then 푥0 is a global minimum 푥 푥0 The Perceptron Algorithm

• Try to minimize the perceptron loss using (sub)gradient descent

(푡+1) (푡) (𝑖) 푤 = 푤 + 훾푡 ⋅ 푦𝑖푥 𝑖:−푦 푓 푥(푖) ≥0 푖 푤 푡 ,푏(푡)

(푡+1) (푡) 푏 = 푏 + 훾푡 ⋅ 푦𝑖 𝑖:−푦 푓 푥(푖) ≥0 푖 푤 푡 ,푏(푡)

• With step size 훾푡 (sometimes called the ) Stochastic Gradient Descent

• To make the training more practical, stochastic gradient descent is used instead of standard gradient descent • Approximate the gradient of a sum by sampling a few indices (as few as one) uniformly at random and averaging 푛 퐾 1 훻 푔 (푥) ≈ 훻 푔 (푥) 푥 𝑖 퐾 푥 𝑖푘 𝑖=1 푘=1 here, each 푖푘 is sampled uniformly at random from {1, … , 푛} • Stochastic gradient descent converges under certain assumptions on the step size 22 Stochastic Gradient Descent

• Setting 퐾 = 1, we can simply pick a random observation 푖 and perform the following update if the 푖푡ℎ data point is misclassified (푡+1) (푡) (𝑖) 푤 = 푤 + 훾푡푦𝑖푥 (푡+1) (푡) 푏 = 푏 + 훾푡푦𝑖 and 푤(푡+1) = 푤(푡) 푏(푡+1) = 푏(푡) otherwise • Sometimes, you will see the perceptron algorithm specified with 훾푡 = 1 for all 푡 23 Application of Perceptron

• Spam email classification

– Represent emails as vectors of counts of certain words (e.g., sir, madam, Nigerian, prince, money, etc.)

– Apply the perceptron algorithm to the resulting vectors

– To predict the label of an unseen email:

• Construct its vector representation, 푥′

• Check whether or not 푤푇푥′ + 푏 is positive or negative Perceptron Learning

• Drawbacks:

– No convergence guarantees if the observations are not linearly separable

– Can overfit

• There are a number of perfect classifiers, but the perceptron algorithm doesn’t have any mechanism for choosing between them

+ + + + + + _ + + _ _ + _ + _ + _ _ _ + _ _ What If the Data Isn‘t Separable?

(1) (푛) 푚 • Input 푥 , 푦1 , … , (푥 , 푦푛) with 푥𝑖 ∈ ℝ and 푦𝑖 ∈ {−1, +1} • We can think of the observations as points in ℝ푚 with an associated sign (either +/- corresponding to 0/1)

• An example with 푚 = 2

+ + + _ + What is a good + _ _ hypothesis space for + _ _ _ _ _ + this problem? + _ + _ + + + What If the Data Isn‘t Separable?

(1) (푛) 푚 • Input 푥 , 푦1 , … , (푥 , 푦푛) with 푥𝑖 ∈ ℝ and 푦𝑖 ∈ {−1, +1} • We can think of the observations as points in ℝ푚 with an associated sign (either +/- corresponding to 0/1)

• An example with 푚 = 2

+ + + _ + What is a good + _ _ hypothesis space for + _ _ _ _ _ + this problem? + _ + _ + + + Adding Features

• Perceptron algorithm only works for linearly separable data

+ + + _ + + _ _ + _ _ _ _ _ + + _ + _ + + +

Can add features to make the data linearly separable over a larger space! Adding Features

• The idea:

– Given the observations 푥(1), … , 푥(푛), construct a feature vector 휙(푥)

– Use 휙 푥(1) , … , 휙 푥(푛) instead of 푥(1), … , 푥(푛) in the learning algorithm

– Goal is to choose 휙 so that 휙 푥(1) , … , 휙 푥(푛) are linearly separable

– Learn linear separators of the form 푤푇휙 푥 (instead of 푤푇푥) • Warning: more expressive features can lead to Adding Features

• Examples 푥1 – 휙 푥1, 푥2 = 푥2 • This is just the input data, without modification

1 푥1 푥 – 휙 푥1, 푥2 = 2 2 푥1 2 푥2 • This corresponds to a second degree polynomial separator, or equivalently, elliptical separators in the original space Adding Features

푥2

2 2 푥1 − 1 + 푥2 − 1 − 1 ≤ 0

푥1 Support Vector Machines

• How can we decide between two perfect classifiers?

+ + + + + + _ + + _ + _ _ + _ + _ _ _ + _ _

• What is the practical difference between these two solutions?