Binary Classification / Perceptron
Nicholas Ruozzi University of Texas at Dallas
Slides adapted from David Sontag and Vibhav Gogate Supervised Learning
• Input: 푥1, 푦1 , … , (푥푛, 푦푛)
푡ℎ 푡ℎ – 푥𝑖 is the 푖 data item and 푦𝑖 is the 푖 label
• Goal: find a function 푓 such that 푓 푥𝑖 is a “good approximation” to 푦𝑖 – Can use it to predict 푦 values for previously unseen 푥 values Examples of Supervised Learning
• Spam email detection
• Handwritten digit recognition
• Stock market prediction
• More? Supervised Learning
• Hypothesis space: set of allowable functions 푓: 푋 → 푌
• Goal: find the “best” element of the hypothesis space
– How do we measure the quality of 푓? Regression
푦
푥
Hypothesis class: linear functions 푓 푥 = 푎푥 + 푏
Squared loss function used to measure the error of the approximation Linear Regression
• In typical regression applications, measure the fit using a squared loss function 2 퐿 푓, 푦𝑖 = 푓 푥𝑖 − 푦𝑖 • Want to minimize the average loss on the training data • For linear regression, the learning problem is then
1 2 min 푎푥𝑖 + 푏 − 푦𝑖 푎,푏 푛 𝑖 • For an unseen data point, 푥, the learning algorithm predicts 푓(푥) Binary Classification
(1) (푛) 푚 • Input 푥 , 푦1 , … , (푥 , 푦푛) with 푥𝑖 ∈ ℝ and 푦𝑖 ∈ {−1, +1} • We can think of the observations as points in ℝ푚 with an associated sign (either +/- corresponding to 0/1)
• An example with 푚 = 2
+ + + _ + _ + _ + _ + _ + + _ _ + _ + _ _ + Binary Classification
(1) (푛) 푚 • Input 푥 , 푦1 , … , (푥 , 푦푛) with 푥𝑖 ∈ ℝ and 푦𝑖 ∈ {−1, +1} • We can think of the observations as points in ℝ푚 with an associated sign (either +/- corresponding to 0/1)
• An example with 푚 = 2
+ + + _ + _ What is a good + _ hypothesis space for + _ + _ + this problem? + _ _ + _ + _ _ + Binary Classification
(1) (푛) 푚 • Input 푥 , 푦1 , … , (푥 , 푦푛) with 푥𝑖 ∈ ℝ and 푦𝑖 ∈ {−1, +1} • We can think of the observations as points in ℝ푚 with an associated sign (either +/- corresponding to 0/1)
• An example with 푚 = 2
+ + + + + + _ What is a good + + _ _ hypothesis space for + _ + _ this problem? + _ _ _ + _ _ Binary Classification
(1) (푛) 푚 • Input 푥 , 푦1 , … , (푥 , 푦푛) with 푥𝑖 ∈ ℝ and 푦𝑖 ∈ {−1, +1} • We can think of the observations as points in ℝ푚 with an associated sign (either +/- corresponding to 0/1)
• An example with 푚 = 2
+ + + + + + _ What is a good + + _ _ hypothesis space for + _ + _ this problem? + _ _ _ + _ _
In this case, we say that the observations are linearly separable Linear Separators
• In 푛 dimensions, a hyperplane is a solution to the equation 푤푇푥 + 푏 = 0 with 푤 ∈ ℝ푛, 푏 ∈ ℝ • Hyperplanes divide ℝ푛 into two distinct sets of points (called halfspaces) 푤푇푥 + 푏 > 0 푤푇푥 + 푏 < 0 Binary Classification
(1) (푛) 푚 • Input 푥 , 푦1 , … , (푥 , 푦푛) with 푥𝑖 ∈ ℝ and 푦𝑖 ∈ {−1, +1} • We can think of the observations as points in ℝ푚 with an associated sign (either +/- corresponding to 0/1)
• An example with 푚 = 2
+ + + + + + _ What is a good + + _ _ hypothesis space for + _ + _ this problem? + _ _ _ + _ _
In this case, we say that the observations are linearly separable The Linearly Separable Case
(1) (푛) 푚 • Input 푥 , 푦1 , … , (푥 , 푦푛) with 푥𝑖 ∈ ℝ and 푦𝑖 ∈ {−1,1} • Hypothesis space: separating hyperplanes 푓 푥 = 푤푇푥 + 푏 • How should we choose the loss function? The Linearly Separable Case
(1) (푛) 푚 • Input 푥 , 푦1 , … , (푥 , 푦푛) with 푥𝑖 ∈ ℝ and 푦𝑖 ∈ {−1,1} • Hypothesis space: separating hyperplanes
푇 푓푤,푏 푥 = 푤 푥 + 푏 • How should we choose the loss function?
– Count the number of misclassifications
(𝑖) 푙표푠푠 = |푦𝑖 − 푠푖푔푛(푓푤,푏(푥 ))| 𝑖
• Tough to optimize, gradient contains no information The Linearly Separable Case
(1) (푛) 푚 • Input 푥 , 푦1 , … , (푥 , 푦푛) with 푥𝑖 ∈ ℝ and 푦𝑖 ∈ {−1,1} • Hypothesis space: separating hyperplanes
푇 푓푤,푏 푥 = 푤 푥 + 푏 • How should we choose the loss function?
– Penalize each misclassification by the size of the violation
(𝑖) 푝푒푟푐푒푝푡푟표푛 푙표푠푠 = max 0, −푦𝑖푓푤,푏 푥 𝑖
• Modified hinge loss (this loss is convex, but not differentiable) The Perceptron Algorithm
• Try to minimize the perceptron loss using (sub)gradient descent
(𝑖) 훻푤(푝푒푟푐푒푝푡푟표푛 푙표푠푠) = −푦𝑖푥 (푖) 𝑖:−푦푖푓푤,푏 푥 ≥0
훻푏(푝푒푟푐푒푝푡푟표푛 푙표푠푠) = −푦𝑖 (푖) 𝑖:−푦푖푓푤,푏 푥 ≥0 Subgradients
• For a convex function 푔(푥), a subgradient at a point 푥0 is any tangent line/plane through the point 푥0 that underestimates the function everywhere
푔(푥)
푥 푥0 Subgradients
• For a convex function 푔(푥), a subgradient at a point 푥0 is any tangent line/plane through the point 푥0 that underestimates the function everywhere
푔(푥)
푥 푥0 Subgradients
• For a convex function 푔(푥), a subgradient at a point 푥0 is any tangent line/plane through the point 푥0 that underestimates the function everywhere
푔(푥)
푥 푥0 Subgradients
• For a convex function 푔(푥), a subgradient at a point 푥0 is any tangent line/plane through the point 푥0 that underestimates the function everywhere
푔(푥)
If 0 is subgradient at 푥0, then 푥0 is a global minimum 푥 푥0 The Perceptron Algorithm
• Try to minimize the perceptron loss using (sub)gradient descent
(푡+1) (푡) (𝑖) 푤 = 푤 + 훾푡 ⋅ 푦𝑖푥 𝑖:−푦 푓 푥(푖) ≥0 푖 푤 푡 ,푏(푡)
(푡+1) (푡) 푏 = 푏 + 훾푡 ⋅ 푦𝑖 𝑖:−푦 푓 푥(푖) ≥0 푖 푤 푡 ,푏(푡)
• With step size 훾푡 (sometimes called the learning rate) Stochastic Gradient Descent
• To make the training more practical, stochastic gradient descent is used instead of standard gradient descent • Approximate the gradient of a sum by sampling a few indices (as few as one) uniformly at random and averaging 푛 퐾 1 훻 푔 (푥) ≈ 훻 푔 (푥) 푥 𝑖 퐾 푥 𝑖푘 𝑖=1 푘=1 here, each 푖푘 is sampled uniformly at random from {1, … , 푛} • Stochastic gradient descent converges under certain assumptions on the step size 22 Stochastic Gradient Descent
• Setting 퐾 = 1, we can simply pick a random observation 푖 and perform the following update if the 푖푡ℎ data point is misclassified (푡+1) (푡) (𝑖) 푤 = 푤 + 훾푡푦𝑖푥 (푡+1) (푡) 푏 = 푏 + 훾푡푦𝑖 and 푤(푡+1) = 푤(푡) 푏(푡+1) = 푏(푡) otherwise • Sometimes, you will see the perceptron algorithm specified with 훾푡 = 1 for all 푡 23 Application of Perceptron
• Spam email classification
– Represent emails as vectors of counts of certain words (e.g., sir, madam, Nigerian, prince, money, etc.)
– Apply the perceptron algorithm to the resulting vectors
– To predict the label of an unseen email:
• Construct its vector representation, 푥′
• Check whether or not 푤푇푥′ + 푏 is positive or negative Perceptron Learning
• Drawbacks:
– No convergence guarantees if the observations are not linearly separable
– Can overfit
• There are a number of perfect classifiers, but the perceptron algorithm doesn’t have any mechanism for choosing between them
+ + + + + + _ + + _ _ + _ + _ + _ _ _ + _ _ What If the Data Isn‘t Separable?
(1) (푛) 푚 • Input 푥 , 푦1 , … , (푥 , 푦푛) with 푥𝑖 ∈ ℝ and 푦𝑖 ∈ {−1, +1} • We can think of the observations as points in ℝ푚 with an associated sign (either +/- corresponding to 0/1)
• An example with 푚 = 2
+ + + _ + What is a good + _ _ hypothesis space for + _ _ _ _ _ + this problem? + _ + _ + + + What If the Data Isn‘t Separable?
(1) (푛) 푚 • Input 푥 , 푦1 , … , (푥 , 푦푛) with 푥𝑖 ∈ ℝ and 푦𝑖 ∈ {−1, +1} • We can think of the observations as points in ℝ푚 with an associated sign (either +/- corresponding to 0/1)
• An example with 푚 = 2
+ + + _ + What is a good + _ _ hypothesis space for + _ _ _ _ _ + this problem? + _ + _ + + + Adding Features
• Perceptron algorithm only works for linearly separable data
+ + + _ + + _ _ + _ _ _ _ _ + + _ + _ + + +
Can add features to make the data linearly separable over a larger space! Adding Features
• The idea:
– Given the observations 푥(1), … , 푥(푛), construct a feature vector 휙(푥)
– Use 휙 푥(1) , … , 휙 푥(푛) instead of 푥(1), … , 푥(푛) in the learning algorithm
– Goal is to choose 휙 so that 휙 푥(1) , … , 휙 푥(푛) are linearly separable
– Learn linear separators of the form 푤푇휙 푥 (instead of 푤푇푥) • Warning: more expressive features can lead to overfitting Adding Features
• Examples 푥1 – 휙 푥1, 푥2 = 푥2 • This is just the input data, without modification
1 푥1 푥 – 휙 푥1, 푥2 = 2 2 푥1 2 푥2 • This corresponds to a second degree polynomial separator, or equivalently, elliptical separators in the original space Adding Features
푥2
2 2 푥1 − 1 + 푥2 − 1 − 1 ≤ 0
푥1 Support Vector Machines
• How can we decide between two perfect classifiers?
+ + + + + + _ + + _ + _ _ + _ + _ _ _ + _ _
• What is the practical difference between these two solutions?