<<

Linear Discriminant Functions: Descent and Convergence

• The Two-Category Linearly Separable Case (5.4)

• Minimizing the Perceptron Criterion (5.5) Role of Linear Discriminant Functions

• A Discriminative Approach • as opposed to Generative approach of Parameter Estimation • Leads to and Artificial Neural Networks • Leads to Support Vector Machines

CSE 555: Srihari Two-category Linearly Separable Case

• Set of n samples y1 , y2 , .. , yn • Some labelled ω1 and some labelled ω2 • Use samples to determine weight vector a such that t a yi > 0 implies class ω1 t a yi < 0 implies class ω2 • If such a weight vector exists then the samples are said to be linearly separable • w.l.o.g. solution vector passes through origin (by mapping from x-space to y-space)

CSE 555: Srihari Normalization

Normalized Data: Raw Data red points changed in sign

Solution vector separates red from Solution vector places all vectors black points on same side of plane

Solution region is intersection of n half-spaces

CSE 555: Srihari Margin

• Solution vector is not unique • Additional requirements to Margin b > 0 constrain solution vector. Two No margin shrinks solution region possible approaches: b = 0 by margins b / ||yi|| 1. Seek unit length weight vector that maximizes minimum distance from samples to separating hyperplane 2. Minimum length weight vector satisfying t a yi > b> 0 where b is called the margin • It is insulated from old boundaries

by b / ||yi||

CSE 555: Srihari Gradient Descent Formulation

t •To find solution to set of inequalities a yi > 0

•Define a criterion function J(a) that is minimized if a is a solution vector

•We will define such a function J later

Start with an arbitrarily chosen weight a(1) and Compute gradient vector ∆ J(a(1)) Next value a(2) obtained by moving some distance from a(1) in the direction of the steepest descent, i.e., along the negative of the gradient. In general,

Learning rate that Sets the step size CSE 555: Srihari Basic Gradient Descent Procedure

CSE 555: Srihari Criterion for step size of Basic Gradient Descent Creiterion Function approximated by by second-order expansion around a(k):

Hessian matrix of Second order partial substituting 2 ∂ J / ∂aia j

Can be minimized by the choice:

CSE 555: Srihari Criterion for step size of Basic Gradient Descent of Second order partial derivative 2 ∂ J / ∂aia j

Can be minimized by the choice:

Alternative approach is Newton’s

CSE 555: Srihari Comparison of simple gradient descent and Newton’s

Simple gradient descent method

Newton’s second order method Has greater improvement per step even when using optimal learning rates for both methods. However added computational burden of inverting the Hessian matrix.

CSE 555: Srihari Perceptron Criterion Function

Obvious criterion function is no of samples misclassified. A poor choice since it is piecewise linear Alternative is the Perceptron criterion function

Set of samples misclassified

If no samples are misclassified then criterion function evaluates to zero. Gradient function evaluates to:

Update Rule becomes

CSE 555: Srihari Comparison of Four Criterion functions No of misclassified samples: Perceptron criterion: Piecewise constant, Piecewise linear, unacceptable acceptable for gradient descent

Squared error: Useful when patterns Squared Error are not linearly separable CSE 555: Srihari with margin Perceptron Criterion as function of weights

Criterion function plotted as a

function of weights a1 and a2

Starts at origin,

Sequence is y2,y3, y1, y3

Second update by y3 takes solution further than first update

by y3

CSE 555: Srihari Convergence of Single-Sample Correction (simpler than batch)

CSE 555: Srihari Perceptron Convergence Theorem Theorem (Perceptron Convergence) If training samples are linearly separable, then the sequence of weight vectors given by Fixed Increment Error Correction Algorithm (Algorithm 4) will terminate at a solution vector

CSE 555: Srihari Proof for Single-Sample Correction

CSE 555: Srihari Proof for Single-Sample Correction

CSE 555: Srihari Bound on the number of corrections

CSE 555: Srihari Some Direct Generalizations

Correction whenever at(k) yk fails to exceed a margin

CSE 555: Srihari Some Direct Generalizations: original

gradient descent algorithm for Jp

CSE 555: Srihari Some Direct Generalizations: Winnow Algorithm Weights a+ and a- associated with each of the categories to be learnt

Advantages: convergence is faster than in a Perceptron because of proper setting of learning rate

Each constituent value does not overshoot its final value

Benefit is pronounced when there areCSE a large 555: number Srihari of irrelevant or redundant features