CS446 Introduction to (Spring 2015) University of Illinois at Urbana-Champaign http://courses.engr.illinois.edu/cs446

LECTURE 4: LINEAR CLASSIFIERS

Prof. Julia Hockenmaier [email protected] Announcements

Homework 1 will be out after class. http://courses.engr.illinois.edu/cs446/Homework/HW1.pdf You have two weeks to complete the assignment. Late policy: – Up to two days late credit for the whole semester, but we don’t give any partial late credit. If you’re late for one assignment by up to 24 hours, that’s 1 of your two late credit days. – We don’t accept assignments that are more than 48 hours late.

Is everybody on Compass??? https://compass2g.illinois.edu/ Let us know if you can’t see our class.

CS446 Machine Learning 2 Last lecture’s key concepts

Decision trees for (binary) classification Non-linear classifiers

Learning decision trees (ID3 algorithm) Greedy heuristic (based on information gain) Originally developed for discrete features

Overfitting What is it? How do we deal with it?

CS446 Machine Learning 3 Today’s key concepts

Learning linear classifiers

Batch algorithms: – for Least-mean squares Online algorithms: – Stochastic gradient descent

CS446 Machine Learning 4 Linear classifiers

CS446 Machine Learning 5 Linear classifiers: f(x) = w0 + wx f(x) > 0 f(x) = 0 x 2 f(x) < 0

x1 Linear classifiers are defined over vector spaces Every hypothesis f(x) is a hyperplane:

f(x) = w0 + wx f(x) is also called the decision boundary – Assign ŷ = 1 to all x where f(x) > 0 – Assign ŷ = -1 to all x where f(x) < 0 ŷ = sgn(f(x))

Hypothesis space for linear classifiers

x2 x2 x2 x2 x2 0 1 0 1 0 1 0 1 0 1 H 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 x x x x x 1 1 0 0 1 1 1 0 1 1 0 1 1 1 0 0 1 1 0 0

x2 x2 x2 x2 x2 x2 0 1 0 1 0 1 0 1 0 1 0 1 0 0 0 0 1 0 0 0 1 0 1 0 0 0 1 0 1 1 x x x x x x 1 1 1 1 1 1 1 0 1 1 1 0 1 1 0 1 1 1 0 1 1 1 0 0

x2 x2 x2 x2 x2 0 1 0 1 0 1 0 1 0 1 0 1 0 0 0 1 0 1 1 0 1 1 0 1 1 x x x x x 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1

CS446 Machine Learning 7 Canonical representation

T T With w = (w1, …, wN) and x = (x1, …, xN) :

f(x) = w0 + wx = w0 + ∑i=1…N wixi w0 is called the bias term. The canonical representation redefines w, x T as w = (w0, w1, …, wN) T and x = (1, x1, …, xN) => f(x) = w·x

CS446 Machine Learning 8 Learning a linear classifier

f(x) > 0 f(x) = 0 x 2 x2 f(x) < 0

x1 x1 Input: Labeled training data Output: A decision boundary f(x) = 0 D = {(x1, y1),…,(xD, yD)} that separates the training data plotted in the sample space X = R2 yi·f(xi) > 0 with : yi = +1, : yi = 1

CS446 Machine Learning 9 Which model should we pick?

We need a metric (aka an objective function) We would like to minimize the probability of misclassifying unseen examples, but we can’t measure that probability. Instead: minimize the number of misclassified training examples CS446 Machine Learning 10 Which model should we pick?

We need a more specific metric: There may be many models that are consistent with the training data.

Loss functions provide such metrics.

CS446 Machine Learning 11 Loss functions for classification

CS446 Machine Learning 12 y·f(x) > 0: Correct classification f(x) > 0 f(x) = 0 x 2 f(x) < 0

x1 An example (x, y) is correctly classified by f(x) if and only if y·f(x) > 0: Case 1 (y = +1 = ŷ): f(x) > 0 ⇒ y·f(x) > 0 Case 2 (y = -1 = ŷ): f(x) < 0 ⇒ y·f(x) > 0 Case 3 (y = +1 ≠ ŷ = -1): f(x) > 0 ⇒ y·f(x) < 0 Case 4 (y = -1 ≠ ŷ = +1): f(x) < 0 ⇒ y·f(x) < 0

Loss functions for classification

Loss = What penalty do we incur if we misclassify x ? L(y, f(x)) is the loss (aka cost) of classifier f on example x when the true label of x is y. We assign label ŷ = sgn(f(x)) to x Plots of L(y, f(x)): x-axis is typically y·f(x) Today: 0-1 loss and square loss (more loss functions later)

CS446 Machine Learning 14 0-1 Loss 0-1 Loss 4 3.5 3 2.5 2 1.5 1 0.5 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 y*f(x)

CS446 Machine Learning 15 0-1 Loss 0-1 Loss 4 3.5 3 2.5 2 1.5 1 0.5 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 y*f(x) L(y, f(x)) = 0 iff y = ŷ = 1 iff y ≠ ŷ L( y·f(x) ) = 0 iff y·f(x) > 0 (correctly classified) = 1 iff y·f(x) < 0 (misclassified)

CS446 Machine Learning 16 Square Loss: (y – f(x))2

Square loss as a function of f(x) Square loss as a function of y*f(x) 4 4 y = +1 3.5 y = -1 3.5 3 3 2.5 2.5 2 2 1.5 1.5 1 1 0.5 0.5 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 f(x) y*f(x) L(y, f(x)) = (y – f(x))2

Note: L(-1, f(x)) = (-1 – f(x))2 = ( 1 + f(x))2 = L(1, -f(x)) (the loss when y=-1 [red] is the mirror of the loss when y=+1 [blue])

CS446 Machine Learning 17 The square loss is a convex upper bound on 0-1 Loss

Loss as a function of y*f(x) 4 0-1 Loss 3.5 Square Loss 3 2.5 2 1.5 1 0.5 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 y*f(x)

CS446 Machine Learning 18 Loss surface

Linear classification: Hypothesis space is parameterized by w Plain English: Each w yields a different classifier

Error/Loss/Risk are all functions of w

CS446 Machine Learning 19 The loss surface Learning = finding the (empirical) global minimum of the error loss surface

global hypothesis minimum space

CS440/ECE448: Intro AI 20 The loss surface Finding the global (empirical) minimum in general error is hard plateau

local minimum

global hypothesis minimum space

CS440/ECE448: Intro AI 21 Convex loss surfaces

(empirical) Convex functions have error no local minima

global minimum

hypothesis space

CS440/ECE448: Intro AI 22 The risk of a classifier R(f)

The risk (aka generalization error) of a classifier f(x) = w·x is its expected loss: (= loss, averaged over all possible data sets): R(f) = ∫ L(y, f(x)) P(x, y) dx,y

Ideal learning objective: Find an f that minimizes risk

CS446 Machine Learning 23 Aside: The i.i.d. assumption

We always assume that training and test items are independently and identically distributed (i.i.d.): – There is a distribution P(X, Y) from which the data D = {(x, y)} is generated. Sometimes it’s useful to rewrite P(X, Y) as P(X)P(Y|X) Usually P(X, Y) is unknown to us (we just know it exists) – Training and test data are samples drawn from the same P(X, Y): they are identically distributed – Each (x, y) is drawn independently from P(X, Y)

CS446 Machine Learning 24 The empirical risk of f(x)

The empirical risk of a classifier f(x) = w·x on data set D = {(x1, y1),…,(xD, yD)} is its average loss on the items in D D 1 i i R (f) = L(y , f(x )) D ∑ D i=1 Realistic learning objective: Find an f that minimizes empirical risk (Note that the learner can ignore the constant 1/D)

CS446 Machine Learning 25 Empirical risk minimization

Learning: Given training data D = {(x1, y1),…,(xD, yD)}, return the classifier f(x) that minimizes the empirical risk RD( f )

CS446 Machine Learning 26 Batch learning: Gradient Descent for Least Mean Squares (LMS)

CS446 Machine Learning 27 Gradient Descent

Iterative batch learning algorithm: – Learner updates the hypothesis based on the entire training data – Learner has to go multiple times over the training data

Goal: Minimize training error/loss – At each step: move w in the direction of steepest descent along the error/loss surface

CS446 Machine Learning 28 Gradient Descent

Error(w): Error of w on training data wi: Weight at iteration i

Error(w)

w w4 w3 w2 w1

CS446 Machine Learning 29 Least Mean Square Error

1 2 Err(w) = ∑ (yd − yˆd ) 2 d∈D LMS Error: Sum of square loss over all training items (multiplied by 0.5 for convenience) D is fixed, so no need to divide by its size

Goal of learning: Find w* = argmin(Err(w))

CS446 Machine Learning 30 Iterative batch learning

Initialization: Initialize w0 (the initial weight vector)

For each iteration: for i = 0…T: Determine by how much to change w based on the entire data set D Δw = computeDelta(D, wi) Update w: wi+1 = update(wi, Δw)

CS446 Machine Learning 31 Gradient Descent: Update

1. Compute ∇Err(wi), the gradient of the training error at wi This requires going over the entire training data T #∂Err(w) ∂Err(w) ∂Err(w)& ∇Err(w) = % , ,..., ( $ ∂w0 ∂w1 ∂wN ' 2. Update w: wi+1 = wi − α∇Err(wi) α >0 is the learning rate

CS446 Machine Learning 32 What’s a gradient?

T #∂Err(w) ∂Err(w) ∂Err(w)& ∇Err(w) = % , ,..., ( $ ∂w0 ∂w1 ∂wN '

The gradient is a vector of partial derivatives It indicates the direction of steepest increase in Err(w) Hence the minus in the upgrade rule: wi − α∇Err(wi)

CS446 Machine Learning 33 Computing ∇Err(wi) ∂Err(w) ∂ 1 = ( y − f(x ))2 ∑ d d (j) 1 2 ∂wi ∂wi 2 d∈D Err(w )= ∑ (yd − f(x)d ) 2 d∈D 1 ∂ 2 = ∑ ( yd − f(xd )) 2 ∂wi d∈D 1 ∂ = ∑ 2(yd − f(xd )) (yd − w ⋅ xd ) 2 d∈D ∂wi

= − ∑ (yd − f (xd ))xdi d∈D

CS446 Machine Learning 34 Gradient descent (batch)

Initialize w0 randomly for i = 0…T: Δw = (0, …., 0) for every training item d = 1…D: i f(xd) = w ·xd for every component of w j = 0…N:

Δwj += α(yd − f(xd))·xdj wi+1 = wi + Δw return wi+1 when it has converged

CS446 Machine Learning 35 The batch update rule for each component of w

D i Δwi = α∑(yd − w ⋅ xd )xdi d=1

Implementing gradient descent: As you go through the training data, you can just accumulate the change

in each component wi of w

36 Learning rate and convergence

The learning rate is also called the step size. More sophisticated algorithms (Conjugate Gradient) choose the step size automatically and converge faster.

– When the learning rate is too small, convergence is very slow – When the learning rate is too large, we may oscillate (overshoot the global minimum) – You have to experiment to find the right learning rate for your task

CS446 Machine Learning 37 Online learning with Stochastic Gradient Descent

CS446 Machine Learning 38 Stochastic Gradient Descent

Online learning algorithm: – Learner updates the hypothesis with each training example – No assumption that we will see the same training examples again – Like batch gradient descent, except we update after seeing each example

CS446 Machine Learning 39 Why online learning?

Too much training data: – Can’t afford to iterate over everything

Streaming scenario: – New data will keep coming – You can’t assume you have seen everything – Useful also for adaptation (e.g. user-specific spam detectors)

CS446 Machine Learning 40 Stochastic Gradient descent (online)

Initialize w0 randomly for m = 0…M: i f(xm) = w ·xm Δwj = α(yd − f(xm))·xmj wi+1 = wi + Δw return wi+1 when it has converged

CS446 Machine Learning 41 Today’s key concepts

Linear classifiers Loss functions Gradient descent Stochastic gradient descent

CS446 Machine Learning 42