Binary Classification / Perceptron

Binary Classification / Perceptron Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and Vibhav Gogate Supervised Learning • Input: 푥1, 푦1 , … , (푥푛, 푦푛) 푡ℎ 푡ℎ – 푥 is the 푖 data item and 푦 is the 푖 label • Goal: find a function 푓 such that 푓 푥 is a “good approximation” to 푦 – Can use it to predict 푦 values for previously unseen 푥 values Examples of Supervised Learning • Spam email detection • Handwritten digit recognition • Stock market prediction • More? Supervised Learning • Hypothesis space: set of allowable functions 푓: 푋 → 푌 • Goal: find the “best” element of the hypothesis space – How do we measure the quality of 푓? Regression 푦 푥 Hypothesis class: linear functions 푓 푥 = 푎푥 + 푏 Squared loss function used to measure the error of the approximation Linear Regression • In typical regression applications, measure the fit using a squared loss function 2 퐿 푓, 푦 = 푓 푥 − 푦 • Want to minimize the average loss on the training data • For linear regression, the learning problem is then 1 2 min 푎푥 + 푏 − 푦 푎,푏 푛 • For an unseen data point, 푥, the learning algorithm predicts 푓(푥) Binary Classification (1) (푛) 푚 • Input 푥 , 푦1 , … , (푥 , 푦푛) with 푥 ∈ ℝ and 푦 ∈ {−1, +1} • We can think of the observations as points in ℝ푚 with an associated sign (either +/- corresponding to 0/1) • An example with 푚 = 2 + + + _ + _ + _ + _ + _ + + _ _ + _ + _ _ + Binary Classification (1) (푛) 푚 • Input 푥 , 푦1 , … , (푥 , 푦푛) with 푥 ∈ ℝ and 푦 ∈ {−1, +1} • We can think of the observations as points in ℝ푚 with an associated sign (either +/- corresponding to 0/1) • An example with 푚 = 2 + + + _ + _ What is a good + _ hypothesis space for + _ + _ + this problem? + _ _ + _ + _ _ + Binary Classification (1) (푛) 푚 • Input 푥 , 푦1 , … , (푥 , 푦푛) with 푥 ∈ ℝ and 푦 ∈ {−1, +1} • We can think of the observations as points in ℝ푚 with an associated sign (either +/- corresponding to 0/1) • An example with 푚 = 2 + + + + + + _ What is a good + + _ _ hypothesis space for + _ + _ this problem? + _ _ _ + _ _ Binary Classification (1) (푛) 푚 • Input 푥 , 푦1 , … , (푥 , 푦푛) with 푥 ∈ ℝ and 푦 ∈ {−1, +1} • We can think of the observations as points in ℝ푚 with an associated sign (either +/- corresponding to 0/1) • An example with 푚 = 2 + + + + + + _ What is a good + + _ _ hypothesis space for + _ + _ this problem? + _ _ _ + _ _ In this case, we say that the observations are linearly separable Linear Separators • In 푛 dimensions, a hyperplane is a solution to the equation 푤푇푥 + 푏 = 0 with 푤 ∈ ℝ푛, 푏 ∈ ℝ • Hyperplanes divide ℝ푛 into two distinct sets of points (called halfspaces) 푤푇푥 + 푏 > 0 푤푇푥 + 푏 < 0 Binary Classification (1) (푛) 푚 • Input 푥 , 푦1 , … , (푥 , 푦푛) with 푥 ∈ ℝ and 푦 ∈ {−1, +1} • We can think of the observations as points in ℝ푚 with an associated sign (either +/- corresponding to 0/1) • An example with 푚 = 2 + + + + + + _ What is a good + + _ _ hypothesis space for + _ + _ this problem? + _ _ _ + _ _ In this case, we say that the observations are linearly separable The Linearly Separable Case (1) (푛) 푚 • Input 푥 , 푦1 , … , (푥 , 푦푛) with 푥 ∈ ℝ and 푦 ∈ {−1,1} • Hypothesis space: separating hyperplanes 푓 푥 = 푤푇푥 + 푏 • How should we choose the loss function? The Linearly Separable Case (1) (푛) 푚 • Input 푥 , 푦1 , … , (푥 , 푦푛) with 푥 ∈ ℝ and 푦 ∈ {−1,1} • Hypothesis space: separating hyperplanes 푇 푓푤,푏 푥 = 푤 푥 + 푏 • How should we choose the loss function? – Count the number of misclassifications () 푙표푠푠 = |푦 − 푠푖푔푛(푓푤,푏(푥 ))| • Tough to optimize, gradient contains no information The Linearly Separable Case (1) (푛) 푚 • Input 푥 , 푦1 , … , (푥 , 푦푛) with 푥 ∈ ℝ and 푦 ∈ {−1,1} • Hypothesis space: separating hyperplanes 푇 푓푤,푏 푥 = 푤 푥 + 푏 • How should we choose the loss function? – Penalize each misclassification by the size of the violation () 푝푒푟푐푒푝푡푟표푛 푙표푠푠 = max 0, −푦푓푤,푏 푥 • Modified hinge loss (this loss is convex, but not differentiable) The Perceptron Algorithm • Try to minimize the perceptron loss using (sub)gradient descent () 훻푤(푝푒푟푐푒푝푡푟표푛 푙표푠푠) = −푦푥 (푖) :−푦푖푓푤,푏 푥 ≥0 훻푏(푝푒푟푐푒푝푡푟표푛 푙표푠푠) = −푦 (푖) :−푦푖푓푤,푏 푥 ≥0 Subgradients • For a convex function 푔(푥), a subgradient at a point 푥0 is any tangent line/plane through the point 푥0 that underestimates the function everywhere 푔(푥) 푥 푥0 Subgradients • For a convex function 푔(푥), a subgradient at a point 푥0 is any tangent line/plane through the point 푥0 that underestimates the function everywhere 푔(푥) 푥 푥0 Subgradients • For a convex function 푔(푥), a subgradient at a point 푥0 is any tangent line/plane through the point 푥0 that underestimates the function everywhere 푔(푥) 푥 푥0 Subgradients • For a convex function 푔(푥), a subgradient at a point 푥0 is any tangent line/plane through the point 푥0 that underestimates the function everywhere 푔(푥) If 0 is subgradient at 푥0, then 푥0 is a global minimum 푥 푥0 The Perceptron Algorithm • Try to minimize the perceptron loss using (sub)gradient descent (푡+1) (푡) () 푤 = 푤 + 훾푡 ⋅ 푦푥 :−푦 푓 푥(푖) ≥0 푖 푤 푡 ,푏(푡) (푡+1) (푡) 푏 = 푏 + 훾푡 ⋅ 푦 :−푦 푓 푥(푖) ≥0 푖 푤 푡 ,푏(푡) • With step size 훾푡 (sometimes called the learning rate) Stochastic Gradient Descent • To make the training more practical, stochastic gradient descent is used instead of standard gradient descent • Approximate the gradient of a sum by sampling a few indices (as few as one) uniformly at random and averaging 푛 퐾 1 훻 푔 (푥) ≈ 훻 푔 (푥) 푥 퐾 푥 푘 =1 푘=1 here, each 푖푘 is sampled uniformly at random from {1, … , 푛} • Stochastic gradient descent converges under certain assumptions on the step size 22 Stochastic Gradient Descent • Setting 퐾 = 1, we can simply pick a random observation 푖 and perform the following update if the 푖푡ℎ data point is misclassified (푡+1) (푡) () 푤 = 푤 + 훾푡푦푥 (푡+1) (푡) 푏 = 푏 + 훾푡푦 and 푤(푡+1) = 푤(푡) 푏(푡+1) = 푏(푡) otherwise • Sometimes, you will see the perceptron algorithm specified with 훾푡 = 1 for all 푡 23 Application of Perceptron • Spam email classification – Represent emails as vectors of counts of certain words (e.g., sir, madam, Nigerian, prince, money, etc.) – Apply the perceptron algorithm to the resulting vectors – To predict the label of an unseen email: • Construct its vector representation, 푥′ • Check whether or not 푤푇푥′ + 푏 is positive or negative Perceptron Learning • Drawbacks: – No convergence guarantees if the observations are not linearly separable – Can overfit • There are a number of perfect classifiers, but the perceptron algorithm doesn’t have any mechanism for choosing between them + + + + + + _ + + _ _ + _ + _ + _ _ _ + _ _ What If the Data Isn‘t Separable? (1) (푛) 푚 • Input 푥 , 푦1 , … , (푥 , 푦푛) with 푥 ∈ ℝ and 푦 ∈ {−1, +1} • We can think of the observations as points in ℝ푚 with an associated sign (either +/- corresponding to 0/1) • An example with 푚 = 2 + + + _ + What is a good + _ _ hypothesis space for + _ _ _ _ _ + this problem? + _ + _ + + + What If the Data Isn‘t Separable? (1) (푛) 푚 • Input 푥 , 푦1 , … , (푥 , 푦푛) with 푥 ∈ ℝ and 푦 ∈ {−1, +1} • We can think of the observations as points in ℝ푚 with an associated sign (either +/- corresponding to 0/1) • An example with 푚 = 2 + + + _ + What is a good + _ _ hypothesis space for + _ _ _ _ _ + this problem? + _ + _ + + + Adding Features • Perceptron algorithm only works for linearly separable data + + + _ + + _ _ + _ _ _ _ _ + + _ + _ + + + Can add features to make the data linearly separable over a larger space! Adding Features • The idea: – Given the observations 푥(1), … , 푥(푛), construct a feature vector 휙(푥) – Use 휙 푥(1) , … , 휙 푥(푛) instead of 푥(1), … , 푥(푛) in the learning algorithm – Goal is to choose 휙 so that 휙 푥(1) , … , 휙 푥(푛) are linearly separable – Learn linear separators of the form 푤푇휙 푥 (instead of 푤푇푥) • Warning: more expressive features can lead to overfitting Adding Features • Examples 푥1 – 휙 푥1, 푥2 = 푥2 • This is just the input data, without modification 1 푥1 푥 – 휙 푥1, 푥2 = 2 2 푥1 2 푥2 • This corresponds to a second degree polynomial separator, or equivalently, elliptical separators in the original space Adding Features 푥2 2 2 푥1 − 1 + 푥2 − 1 − 1 ≤ 0 푥1 Support Vector Machines • How can we decide between two perfect classifiers? + + + + + + _ + + _ + _ _ + _ + _ _ _ + _ _ • What is the practical difference between these two solutions?.

Load more