Linear Discriminant Functions

Linear Discriminant Functions Linear discriminant functions and decision surfaces • Definition It is a function that is a linear combination of the components of x t g(x) = w x + w0 (1) where w is the weight vector and w0 the bias • A two-category classifier with a discriminant function of the form (1) uses the following rule: Decide ω1 if g(x) > 0 and ω2 if g(x) < 0 ⇔ Decide ω1 if t w x > -w0 and ω2 otherwise If g(x) = 0 ⇒ x is assigned to either class 1 2 – The equation g(x) = 0 defines the decision surface that separates points assigned to the category ω1 from points assigned to the category ω2 – When g(x) is linear, the decision surface is a hyperplane – Algebraic measure of the distance from x to the hyperplane (interesting result!) 3 4 rw x = x + p w t 2 since g(x p ) = 0 and w w = w r # & t t rw x g(x) = w x + w0 " w % x p + ( + w0 $ w ' xp r = g(x ) + wt w p w w g(x) t w " r = x w H w in particular d([0,0],H) = 0 w – In conclusion, a linear discriminant function divides the feature space by a hyperplane decision surface ! – The orientation of the surface is determined by the normal vector w and the location of the surface is determined by the bias 5 – The multi-category case • We define c linear discriminant functions t gi (x) = wi x + wi0 i =1,...,c and assign x to ωi if gi(x) > gj(x) ∀ j ≠ i; in case of ties, the classification is undefined • In this case, the classifier is a “linear machine” • A linear! machine divides the feature space into c decision regions, with gi(x) being the largest discriminant if x is in the region Ri • For a two contiguous regions Ri and Rj; the boundary that separates them is a portion of hyperplane Hij defined by: gi(x) = gj(x) t ⇔ (wi – wj) x + (wi0 – wj0) = 0 • wi – wj is normal to Hij and gi ! g j d( x,H ij ) = 6 wi ! w j 7 Generalized Linear Discriminant Functions • Decision boundaries which separate between classes may not always be linear • The complexity of the boundaries may sometimes request the use of highly non-linear surfaces • A popular approach to generalize the concept of linear decision functions is to consider a generalized decision function as: g(x) = w1f1(x) + w2f2(x) + … + wNfN(x) + wN+1 (1) where fi(x), 1 ≤ i ≤ N are scalar functions of the pattern x, x ∈ Rn (Euclidean Space) 8 • Introducing fn+1(x) = 1 we get: N +1 T g(x) = "wi f i (x) = w .y i=1 T where w = (w1,w2,...,wN ,wN +1) T and y = (f 1(x), f 2(x),..., f N (x), f N +1(x)) • This latter representation of g(x) implies that any decision function defined by equation (1) can be treated !as linear in the (N + 1) dimensional space (N + 1 > n) • g(x) maintains its non-linearity characteristics in Rn 9 • The most commonly used generalized decision function is g(x) for which fi(x) (1 ≤ i ≤N) are polynomials g(x) = w0 + "w ixi + " "# ijxi x j + " " "$ ijkxi x j xk + .... i=1:N i=1:N j= 0 k=1:N i=1:N j=1:N • Quadratic decision functions for a 2-dimensional feature space ! 2 2 g(x) = w1x1 + w2 x1x2 + w3 x2 + w4 x1 + w5 x2 + w6 T 2 2 T here : w = (w1,w2,...,w6 ) and y = (x1 ,x1x2,x2 ,x1,x2,1) 10 ! • For patterns x ∈Rn, the most general quadratic decision function is given by: n n$1 n n 2 g(x) = #"ii xi + # # "ij xi x j + #wi xi + wn +1 (2) i=1 i=1 j= i+1 i=1 • The number of terms at the right-hand side is: l = N +1 ! n(n "1) = n + + n +1 2 (n +1)(n + 2) = 2 This is the total number of weights which are the free parameters of the problem – n != 3 => 10-dimensional – n = 10 => 65-dimensional 11 • In the case of polynomial decision functions of order m, a typical fi(x) is given by: f (x) x e1 x e2 ...x em i = i1 i2 im where 1" i1,i2,...,im " n and ei,1" i " m is 0 or 1. – It is a polynomial with a degree between 0 and m. To avoid repetitions, i1 ≤ i2 ≤ …≤ im ! n n n gm (x) = ... w x x ...x + gm"1(x) ## # i1i2 ...im i1 i2 im i1 =1 i2 = i1 im = im"1 where gm(x) is the most general polynomial decision function of order m 12 ! Example 1: Let n = 3 and m = 2 then: 3 3 g 2 ( x ) = w x x + w x + w x + w x + w ! ! i1i2 i1 i2 1 1 2 2 3 3 4 i1 =1 i2 =i1 2 2 2 = w11 x1 + w12 x1 x2 + w13 x1 x3 + w22 x2 + w23 x2 x3 + w33 x3 + w1 x1 + w2 x2 + w3 x3 + w4 Example 2: Let n = 2 and m = 3 then: 2 2 2 g 3 ( x ) = w x x x + g 2 ( x ) ! ! ! i1i2i3 i1 i2 i3 i1 =1 i2 =i1 i3 =i2 3 2 2 3 2 = w111 x1 + w112 x1 x2 + w122 x1 x2 + w222 x2 + g ( x ) 2 2 where g 2 ( x ) = w x x + g1 ( x ) ! ! i1i2 i1 i2 i1 =1 i2 =i1 2 2 13 = w11 x1 + w12 x1 x2 + w22 x2 + w1 x1 + w2 x2 + w3 14 15 • Augmentation • Incorporate labels into data • Margin 17 18 Learning Linear Classifiers Basic Ideas Directly “fit” linear boundary in feature space. • 1) Define an error function for classification – Number of errors? – Total distance of mislabeled points from boundary? – Least squares error – Within/between class scatter • Optimize the error function on training data – Gradient descent – Modified gradient descent – Various search proedures. – How much data is needed? 20 Two-category case • Given x1, x2,…, xn sample points, with true category labels: α1, α2,…,αn " =1 $ if point xi is from class '1 i % " = #1 i & if point xi is from class '2 • Decision are made according to: t if a xi > 0 class "1 is chosen t if a xi < 0 class "2 is chosen ! t • Now these decisions are wrong when a xi is negative and belongs to class ω1. Let yi = αi xi Then yi >0 when correctly labelled, negative otherwise. ! 21 • Separating vector (Solution vector) t – Find a such that a yi>0 for all i. t • a yi=0 defines a hyperplane with a as normal x2 x1 22 • Problem: solution not unique • Add Additional constraints: t Margin, the distance away from boundary a yi>b 23 Margins in data space b Larger margins promote uniqueness for underconstrained problems 24 Finding a solution vector by minimizing an error criterion What error function? How do we weight the points? All the points or only error points? Only error points: Perceptron Criterion t t t JP (a ) = $"(a yi ) Y = {yi | a yi < 0} yi #Y Perceptron Criterion with Margin t t t JP (a ) = "(a yi ) Y = {yi | a yi < b} $ 25 yi #Y ! 26 Minimizing Error via Gradient Descent 27 • The min(max) problem: min f (x) x • But we learned in calculus how to solve that kind of question! 28 Motivation • Not exactly, n • Functions: f : R ! R • High order polynomials: 1 3 1 5 1 7 x ! x + x ! x 6 120 5040 • What about function that don’t have an analytic presentation: “Black Box” 29 "1 % "1 % f := (x, y) ! cos$ x ' cos$ y' x #2 & #2 & 30 Directional Derivatives: first, the one dimension derivative: ! 31 Directional Derivatives : Along the Axes… !f (x, y) !y !f (x, y) !x 32 Directional Derivatives : In general direction… v ! R2 v =1 !f (x, y) !v 33 Directional Derivatives !f (x, y) !f (x, y) !y !x 34 2 The Gradient: Definition in R f : R2 ! R & 'f 'f # (f (x, y) := $ ! % 'x 'y " In the plane 35 The Gradient: Definition f : Rn ! R & 'f 'f # f (x ,..., x ) : $ ,..., ! ( 1 n = $ ! % 'x1 'xn " 36 The Gradient Properties • The gradient defines (hyper) plane approximating the function infinitesimally #f #f !z = "!x + "!y #x #y 37 The Gradient properties • Computing directional derivatives • By the chain rule: (important for later use) "f v =1 ( p) = (!f ) ,v "v p Want rate of change along ν, at a point p 38 The Gradient properties 1 is maximal choosing v = "(!f ) p !f (!f ) p !v is minimal choosing #1 v = "(!f ) p (!f ) p (intuition: the gradient points to the greatest change in direction) 39 The Gradient Properties n • Let f : R " " # R be a smooth function around P, if f has local minimum (maximum) at p then, ! r ("f ) p = 0 (Necessary for local min(max)) 40 ! The Gradient Properties Intuition 41 The Gradient Properties Formally: for any n We get: v ! R \{0} df (p + t " v) 0 = = (#f ) ,v dt p r $ (#f ) p = 0 42 ! The Gradient Properties • We found the best INFINITESIMAL DIRECTION at each point, • Looking for minimum using “blind man” procedure • How can get to the minimum using this knowledge? 43 Steepest Descent • Algorithm n Initialization: x0 ! R loop while "f (xi ) > # compute search direction hi = "f (xi ) update xi+1 = xi " # $ hi end loop ! ! ! 44 Setting the Learning Rate Intelligently Minimizing this approximation: 45 Minimizing Perceptron Criterion Perceptron Criterion t t t JP (a ) = $"a yi Y = {yi | a yi < 0} yi #Y t t % a (JP (a )) = % a $"a yi yi #Y = $"yi yi #Y Which gives the update rule : a( j +1) = a(k) +&k $ yi yi #Yk 46 Where Yk is the misclassified set at step k ! Batch Perceptron Update 47 48 Simplest Case 49 When does perceptron rule converge? 50 + + 51 Four learning criteria Different Error Functions as a function of weights in a linear classifier.

Linear Discriminant Functions

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support