Binary Classification / Perceptron

Binary Classification / Perceptron

Binary Classification / Perceptron Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and Vibhav Gogate Supervised Learning • Input: 푥1, 푦1 , … , (푥푛, 푦푛) 푡ℎ 푡ℎ – 푥 is the 푖 data item and 푦 is the 푖 label • Goal: find a function 푓 such that 푓 푥 is a “good approximation” to 푦 – Can use it to predict 푦 values for previously unseen 푥 values Examples of Supervised Learning • Spam email detection • Handwritten digit recognition • Stock market prediction • More? Supervised Learning • Hypothesis space: set of allowable functions 푓: 푋 → 푌 • Goal: find the “best” element of the hypothesis space – How do we measure the quality of 푓? Regression 푦 푥 Hypothesis class: linear functions 푓 푥 = 푎푥 + 푏 Squared loss function used to measure the error of the approximation Linear Regression • In typical regression applications, measure the fit using a squared loss function 2 퐿 푓, 푦 = 푓 푥 − 푦 • Want to minimize the average loss on the training data • For linear regression, the learning problem is then 1 2 min 푎푥 + 푏 − 푦 푎,푏 푛 • For an unseen data point, 푥, the learning algorithm predicts 푓(푥) Binary Classification (1) (푛) 푚 • Input 푥 , 푦1 , … , (푥 , 푦푛) with 푥 ∈ ℝ and 푦 ∈ {−1, +1} • We can think of the observations as points in ℝ푚 with an associated sign (either +/- corresponding to 0/1) • An example with 푚 = 2 + + + _ + _ + _ + _ + _ + + _ _ + _ + _ _ + Binary Classification (1) (푛) 푚 • Input 푥 , 푦1 , … , (푥 , 푦푛) with 푥 ∈ ℝ and 푦 ∈ {−1, +1} • We can think of the observations as points in ℝ푚 with an associated sign (either +/- corresponding to 0/1) • An example with 푚 = 2 + + + _ + _ What is a good + _ hypothesis space for + _ + _ + this problem? + _ _ + _ + _ _ + Binary Classification (1) (푛) 푚 • Input 푥 , 푦1 , … , (푥 , 푦푛) with 푥 ∈ ℝ and 푦 ∈ {−1, +1} • We can think of the observations as points in ℝ푚 with an associated sign (either +/- corresponding to 0/1) • An example with 푚 = 2 + + + + + + _ What is a good + + _ _ hypothesis space for + _ + _ this problem? + _ _ _ + _ _ Binary Classification (1) (푛) 푚 • Input 푥 , 푦1 , … , (푥 , 푦푛) with 푥 ∈ ℝ and 푦 ∈ {−1, +1} • We can think of the observations as points in ℝ푚 with an associated sign (either +/- corresponding to 0/1) • An example with 푚 = 2 + + + + + + _ What is a good + + _ _ hypothesis space for + _ + _ this problem? + _ _ _ + _ _ In this case, we say that the observations are linearly separable Linear Separators • In 푛 dimensions, a hyperplane is a solution to the equation 푤푇푥 + 푏 = 0 with 푤 ∈ ℝ푛, 푏 ∈ ℝ • Hyperplanes divide ℝ푛 into two distinct sets of points (called halfspaces) 푤푇푥 + 푏 > 0 푤푇푥 + 푏 < 0 Binary Classification (1) (푛) 푚 • Input 푥 , 푦1 , … , (푥 , 푦푛) with 푥 ∈ ℝ and 푦 ∈ {−1, +1} • We can think of the observations as points in ℝ푚 with an associated sign (either +/- corresponding to 0/1) • An example with 푚 = 2 + + + + + + _ What is a good + + _ _ hypothesis space for + _ + _ this problem? + _ _ _ + _ _ In this case, we say that the observations are linearly separable The Linearly Separable Case (1) (푛) 푚 • Input 푥 , 푦1 , … , (푥 , 푦푛) with 푥 ∈ ℝ and 푦 ∈ {−1,1} • Hypothesis space: separating hyperplanes 푓 푥 = 푤푇푥 + 푏 • How should we choose the loss function? The Linearly Separable Case (1) (푛) 푚 • Input 푥 , 푦1 , … , (푥 , 푦푛) with 푥 ∈ ℝ and 푦 ∈ {−1,1} • Hypothesis space: separating hyperplanes 푇 푓푤,푏 푥 = 푤 푥 + 푏 • How should we choose the loss function? – Count the number of misclassifications () 푙표푠푠 = |푦 − 푠푖푔푛(푓푤,푏(푥 ))| • Tough to optimize, gradient contains no information The Linearly Separable Case (1) (푛) 푚 • Input 푥 , 푦1 , … , (푥 , 푦푛) with 푥 ∈ ℝ and 푦 ∈ {−1,1} • Hypothesis space: separating hyperplanes 푇 푓푤,푏 푥 = 푤 푥 + 푏 • How should we choose the loss function? – Penalize each misclassification by the size of the violation () 푝푒푟푐푒푝푡푟표푛 푙표푠푠 = max 0, −푦푓푤,푏 푥 • Modified hinge loss (this loss is convex, but not differentiable) The Perceptron Algorithm • Try to minimize the perceptron loss using (sub)gradient descent () 훻푤(푝푒푟푐푒푝푡푟표푛 푙표푠푠) = −푦푥 (푖) :−푦푖푓푤,푏 푥 ≥0 훻푏(푝푒푟푐푒푝푡푟표푛 푙표푠푠) = −푦 (푖) :−푦푖푓푤,푏 푥 ≥0 Subgradients • For a convex function 푔(푥), a subgradient at a point 푥0 is any tangent line/plane through the point 푥0 that underestimates the function everywhere 푔(푥) 푥 푥0 Subgradients • For a convex function 푔(푥), a subgradient at a point 푥0 is any tangent line/plane through the point 푥0 that underestimates the function everywhere 푔(푥) 푥 푥0 Subgradients • For a convex function 푔(푥), a subgradient at a point 푥0 is any tangent line/plane through the point 푥0 that underestimates the function everywhere 푔(푥) 푥 푥0 Subgradients • For a convex function 푔(푥), a subgradient at a point 푥0 is any tangent line/plane through the point 푥0 that underestimates the function everywhere 푔(푥) If 0 is subgradient at 푥0, then 푥0 is a global minimum 푥 푥0 The Perceptron Algorithm • Try to minimize the perceptron loss using (sub)gradient descent (푡+1) (푡) () 푤 = 푤 + 훾푡 ⋅ 푦푥 :−푦 푓 푥(푖) ≥0 푖 푤 푡 ,푏(푡) (푡+1) (푡) 푏 = 푏 + 훾푡 ⋅ 푦 :−푦 푓 푥(푖) ≥0 푖 푤 푡 ,푏(푡) • With step size 훾푡 (sometimes called the learning rate) Stochastic Gradient Descent • To make the training more practical, stochastic gradient descent is used instead of standard gradient descent • Approximate the gradient of a sum by sampling a few indices (as few as one) uniformly at random and averaging 푛 퐾 1 훻 푔 (푥) ≈ 훻 푔 (푥) 푥 퐾 푥 푘 =1 푘=1 here, each 푖푘 is sampled uniformly at random from {1, … , 푛} • Stochastic gradient descent converges under certain assumptions on the step size 22 Stochastic Gradient Descent • Setting 퐾 = 1, we can simply pick a random observation 푖 and perform the following update if the 푖푡ℎ data point is misclassified (푡+1) (푡) () 푤 = 푤 + 훾푡푦푥 (푡+1) (푡) 푏 = 푏 + 훾푡푦 and 푤(푡+1) = 푤(푡) 푏(푡+1) = 푏(푡) otherwise • Sometimes, you will see the perceptron algorithm specified with 훾푡 = 1 for all 푡 23 Application of Perceptron • Spam email classification – Represent emails as vectors of counts of certain words (e.g., sir, madam, Nigerian, prince, money, etc.) – Apply the perceptron algorithm to the resulting vectors – To predict the label of an unseen email: • Construct its vector representation, 푥′ • Check whether or not 푤푇푥′ + 푏 is positive or negative Perceptron Learning • Drawbacks: – No convergence guarantees if the observations are not linearly separable – Can overfit • There are a number of perfect classifiers, but the perceptron algorithm doesn’t have any mechanism for choosing between them + + + + + + _ + + _ _ + _ + _ + _ _ _ + _ _ What If the Data Isn‘t Separable? (1) (푛) 푚 • Input 푥 , 푦1 , … , (푥 , 푦푛) with 푥 ∈ ℝ and 푦 ∈ {−1, +1} • We can think of the observations as points in ℝ푚 with an associated sign (either +/- corresponding to 0/1) • An example with 푚 = 2 + + + _ + What is a good + _ _ hypothesis space for + _ _ _ _ _ + this problem? + _ + _ + + + What If the Data Isn‘t Separable? (1) (푛) 푚 • Input 푥 , 푦1 , … , (푥 , 푦푛) with 푥 ∈ ℝ and 푦 ∈ {−1, +1} • We can think of the observations as points in ℝ푚 with an associated sign (either +/- corresponding to 0/1) • An example with 푚 = 2 + + + _ + What is a good + _ _ hypothesis space for + _ _ _ _ _ + this problem? + _ + _ + + + Adding Features • Perceptron algorithm only works for linearly separable data + + + _ + + _ _ + _ _ _ _ _ + + _ + _ + + + Can add features to make the data linearly separable over a larger space! Adding Features • The idea: – Given the observations 푥(1), … , 푥(푛), construct a feature vector 휙(푥) – Use 휙 푥(1) , … , 휙 푥(푛) instead of 푥(1), … , 푥(푛) in the learning algorithm – Goal is to choose 휙 so that 휙 푥(1) , … , 휙 푥(푛) are linearly separable – Learn linear separators of the form 푤푇휙 푥 (instead of 푤푇푥) • Warning: more expressive features can lead to overfitting Adding Features • Examples 푥1 – 휙 푥1, 푥2 = 푥2 • This is just the input data, without modification 1 푥1 푥 – 휙 푥1, 푥2 = 2 2 푥1 2 푥2 • This corresponds to a second degree polynomial separator, or equivalently, elliptical separators in the original space Adding Features 푥2 2 2 푥1 − 1 + 푥2 − 1 − 1 ≤ 0 푥1 Support Vector Machines • How can we decide between two perfect classifiers? + + + + + + _ + + _ + _ _ + _ + _ _ _ + _ _ • What is the practical difference between these two solutions?.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    32 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us