<<

Linear Discriminant Functions Linear discriminant functions and decision surfaces • Definition It is a function that is a linear combination of the components of x t g(x) = w x + w0 (1) where w is the weight vector and w0 the bias

• A two-category classifier with a discriminant function of the form (1) uses the following rule:

Decide ω1 if g(x) > 0 and ω2 if g(x) < 0 ⇔ Decide ω1 if t w x > -w0 and ω2 otherwise If g(x) = 0 ⇒ x is assigned to either class 1 2 – The equation g(x) = 0 defines the decision surface that separates points assigned to the category ω1 from points assigned to the category ω2

– When g(x) is linear, the decision surface is a hyperplane

– Algebraic measure of the distance from x to the hyperplane (interesting result!)

3 4 rw x = x + p w

t 2 since g(x p ) = 0 and w w = w r # & t t rw x g(x) = w x + w0 " w % x p + ( + w0 $ w ' xp r = g(x ) + wt w p w w g(x) t w " r = x w H w in particular d([0,0],H) = 0 w – In conclusion, a linear discriminant function divides the feature space by a hyperplane decision surface ! – The orientation of the surface is determined by the normal vector w and the location of the surface is determined by the bias 5 – The multi-category case

• We define c linear discriminant functions t gi (x) = wi x + wi0 i =1,...,c

and assign x to ωi if gi(x) > gj(x) ∀ j ≠ i; in case of ties, the classification is undefined • In this case, the classifier is a “linear machine” • A linear! machine divides the feature space into c decision regions, with gi(x) being the largest discriminant if x is in the region Ri

• For a two contiguous regions Ri and Rj; the boundary that separates them is a portion of hyperplane Hij defined by:

gi(x) = gj(x) t ⇔ (wi – wj) x + (wi0 – wj0) = 0

• wi – wj is normal to Hij and

gi ! g j d( x,H ij ) = 6 wi ! w j 7 Generalized Linear Discriminant Functions • Decision boundaries which separate between classes may not always be linear • The complexity of the boundaries may sometimes request the use of highly non-linear surfaces • A popular approach to generalize the concept of linear decision functions is to consider a generalized decision function as:

g(x) = w1f1(x) + w2f2(x) + … + wNfN(x) + wN+1 (1)

where fi(x), 1 ≤ i ≤ N are scalar functions of the pattern x, x ∈ Rn ()

8 • Introducing fn+1(x) = 1 we get: N +1 T g(x) = "wi f i (x) = w .y i=1 T where w = (w1,w2,...,wN ,wN +1) T and y = (f 1(x), f 2(x),..., f N (x), f N +1(x)) • This latter representation of g(x) implies that any decision function defined by equation (1) can be treated !as linear in the (N + 1) dimensional space (N + 1 > n)

• g(x) maintains its non-linearity characteristics in Rn

9 • The most commonly used generalized decision function is g(x) for which fi(x) (1 ≤ i ≤N) are

g(x) = w0 + "w ixi + " "# ijxi x j + " " "$ ijkxi x j xk + .... i=1:N i=1:N j= 0 k=1:N i=1:N j=1:N

• Quadratic decision functions for a 2-dimensional feature space ! 2 2 g(x) = w1x1 + w2 x1x2 + w3 x2 + w4 x1 + w5 x2 + w6 T 2 2 T here : w = (w1,w2,...,w6 ) and y = (x1 ,x1x2,x2 ,x1,x2,1)

10 ! • For patterns x ∈Rn, the most general quadratic decision function is given by: n n$1 n n 2 g(x) = #"ii xi + # # "ij xi x j + #wi xi + wn +1 (2) i=1 i=1 j= i+1 i=1 • The number of terms at the right-hand side is: l = N +1 ! n(n "1) = n + + n +1 2 (n +1)(n + 2) = 2 This is the total number of weights which are the free parameters of the problem – n != 3 => 10-dimensional – n = 10 => 65-dimensional 11 • In the case of decision functions of m, a

typical fi(x) is given by:

f (x) x e1 x e2 ...x em i = i1 i2 im where 1" i1,i2,...,im " n and ei,1" i " m is 0 or 1.

– It is a polynomial with a degree between 0 and m. To avoid repetitions, i1 ≤ i2 ≤ …≤ im

! n n n gm (x) = ... w x x ...x + gm"1(x) ## # i1i2 ...im i1 i2 im i1 =1 i2 = i1 im = im"1

where gm(x) is the most general polynomial decision

function of order m 12 ! Example 1: Let n = 3 and m = 2 then:

3 3 g 2 ( x ) = w x x + w x + w x + w x + w ! ! i1i2 i1 i2 1 1 2 2 3 3 4 i1 =1 i2 =i1 2 2 2 = w11 x1 + w12 x1 x2 + w13 x1 x3 + w22 x2 + w23 x2 x3 + w33 x3

+ w1 x1 + w2 x2 + w3 x3 + w4

Example 2: Let n = 2 and m = 3 then: 2 2 2 g 3 ( x ) = w x x x + g 2 ( x ) ! ! ! i1i2i3 i1 i2 i3 i1 =1 i2 =i1 i3 =i2 3 2 2 3 2 = w111 x1 + w112 x1 x2 + w122 x1 x2 + w222 x2 + g ( x ) 2 2 where g 2 ( x ) = w x x + g1 ( x ) ! ! i1i2 i1 i2 i1 =1 i2 =i1

2 2 13 = w11 x1 + w12 x1 x2 + w22 x2 + w1 x1 + w2 x2 + w3 14 15

• Augmentation

• Incorporate labels into data

• Margin

17 18 Learning Linear Classifiers Basic Ideas

Directly “fit” linear boundary in feature space. • 1) Define an error function for classification – Number of errors? – Total distance of mislabeled points from boundary? – Least squares error – Within/between class scatter • Optimize the error function on training data – Gradient descent – Modified gradient descent – Various search proedures. – How much data is needed?

20 Two-category case

• Given x1, x2,…, xn sample points, with true category labels: α1, α2,…,αn

" =1 $ if point xi is from class '1 i % " = #1 i & if point xi is from class '2 • Decision are made according to: t if a xi > 0 class "1 is chosen t if a xi < 0 class "2 is chosen ! t • Now these decisions are wrong when a xi is negative and belongs to class ω1. Let yi = αi xi Then yi >0 when correctly labelled, negative otherwise. !

21 • Separating vector (Solution vector) t – Find a such that a yi>0 for all i. t • a yi=0 defines a hyperplane with a as normal

x2

x1

22 • Problem: solution not unique • Add Additional constraints: t Margin, the distance away from boundary a yi>b

23 Margins in data space

b

Larger margins promote uniqueness for underconstrained problems 24 Finding a solution vector by minimizing an error criterion What error function? How do we weight the points? All the points or only error points? Only error points: Perceptron Criterion

t t t JP (a ) = $"(a yi ) Y = {yi | a yi < 0} yi #Y Perceptron Criterion with Margin

t t t JP (a ) = "(a yi ) Y = {yi | a yi < b} $ 25 yi #Y

! 26 Minimizing Error via Gradient Descent

27 • The min(max) problem: min f (x) x

• But we learned in calculus how to solve that kind of question!

28 Motivation

• Not exactly, n • Functions: f : R ! R • High order polynomials: 1 3 1 5 1 7 x ! x + x ! x 6 120 5040

• What about function that don’t have an analytic presentation: “Black Box” 29 "1 % "1 % f := (x, y) ! cos$ x ' cos$ y' x #2 & #2 &

30 Directional Derivatives: first, the one dimension derivative:

!

31 Directional Derivatives : Along the Axes…

!f (x, y) !y !f (x, y) !x

32 Directional Derivatives : In general direction… v ! R2 v =1

!f (x, y) !v

33 Directional Derivatives

!f (x, y) !f (x, y) !y !x

34 2 The Gradient: Definition in R f : R2 ! R & 'f 'f # (f (x, y) := $ ! % 'x 'y "

In the plane

35 The Gradient: Definition

f : Rn ! R

& 'f 'f # f (x ,..., x ) : $ ,..., ! ( 1 n = $ ! % 'x1 'xn "

36 The Gradient Properties • The gradient defines (hyper) plane approximating the function infinitesimally

#f #f !z = "!x + "!y #x #y

37 The Gradient properties

• Computing directional derivatives • By the chain rule: (important for later use) "f v =1 ( p) = (!f ) ,v "v p Want rate of change along ν, at a point p

38 The Gradient properties

1 is maximal choosing v = "(!f ) p !f (!f ) p !v is minimal choosing #1 v = "(!f ) p (!f ) p

(intuition: the gradient points to the greatest change in direction) 39 The Gradient Properties

n • Let f : R " " # R be a smooth function around P, if f has local minimum (maximum) at p then, ! r ("f ) = 0 p (Necessary for local min(max)) 40

! The Gradient Properties

Intuition

41 The Gradient Properties

Formally: for any n We get: v ! R \{0}

df (p + t " v) 0 = = (#f ) ,v dt p r $ (#f ) = 0 p

42

! The Gradient Properties

• We found the best INFINITESIMAL DIRECTION at each point, • Looking for minimum using “blind man” procedure • How can get to the minimum using this knowledge?

43 Steepest Descent

• Algorithm n Initialization: x0 ! R loop while "f (xi ) > # compute search direction hi = "f (xi )

update xi+1 = xi " # $ hi end loop ! ! ! 44 Setting the Learning Rate Intelligently

Minimizing this approximation:

45 Minimizing Perceptron Criterion Perceptron Criterion

t t t JP (a ) = $"a yi Y = {yi | a yi < 0} yi #Y

t t % a (JP (a )) = % a $"a yi yi #Y

= $"yi yi #Y Which gives the update rule :

a( j +1) = a(k) +&k $ yi yi #Yk

46 Where Yk is the misclassified set at step k

! Batch Perceptron Update

47 48 Simplest Case

49 When does perceptron rule converge?

50 +

+

51 Four learning criteria Different Error Functions as a function of weights in a linear classifier.

At the upper left is the total number of patterns misclassified, which is piecewise constant and hence unacceptable for gradient descent procedures.

At the upper right is the Perceptron criterion (Eq. 16), which is piecewise linear and acceptable for gradient descent.

The lower left is squared error (Eq. 32), which has nice analytic properties and is useful even when the patterns are not linearly separable.

The lower right is the square error with margin. A designer may adjust the margin b in order to force the solution vector to lie toward the middle of the b = 0 solution region in hopes of improving generalization of the resulting classifier. 52 Three Different issues 1) Class boundary expressiveness 2) Error definition 3) Learning method

53 Fisher’s linear disc.

54 Fisher’s Linear Discriminant Intuition

55 We are looking for a good projection Write this in terms of w

So,

Next,

56 Define the scatter :

Then,

Thus the denominator can be written:

57 Rayleigh Quotient Maximizing equivalent to solving: With solution:

58 59 60 61 62