Linear Discriminant Functions

Linear Discriminant Functions Linear discriminant functions and decision surfaces • Definition It is a function that is a linear combination of the components of x t g(x) = w x + w0 (1) where w is the weight vector and w0 the bias • A two-category classifier with a discriminant function of the form (1) uses the following rule: Decide ω1 if g(x) > 0 and ω2 if g(x) < 0 ⇔ Decide ω1 if t w x > -w0 and ω2 otherwise If g(x) = 0 ⇒ x is assigned to either class 1 2 – The equation g(x) = 0 defines the decision surface that separates points assigned to the category ω1 from points assigned to the category ω2 – When g(x) is linear, the decision surface is a hyperplane – Algebraic measure of the distance from x to the hyperplane (interesting result!) 3 4 rw x = x + p w t 2 since g(x p ) = 0 and w w = w r # & t t rw x g(x) = w x + w0 " w % x p + ( + w0 $ w ' xp r = g(x ) + wt w p w w g(x) t w " r = x w H w in particular d([0,0],H) = 0 w – In conclusion, a linear discriminant function divides the feature space by a hyperplane decision surface ! – The orientation of the surface is determined by the normal vector w and the location of the surface is determined by the bias 5 – The multi-category case • We define c linear discriminant functions t gi (x) = wi x + wi0 i =1,...,c and assign x to ωi if gi(x) > gj(x) ∀ j ≠ i; in case of ties, the classification is undefined • In this case, the classifier is a “linear machine” • A linear! machine divides the feature space into c decision regions, with gi(x) being the largest discriminant if x is in the region Ri • For a two contiguous regions Ri and Rj; the boundary that separates them is a portion of hyperplane Hij defined by: gi(x) = gj(x) t ⇔ (wi – wj) x + (wi0 – wj0) = 0 • wi – wj is normal to Hij and gi ! g j d( x,H ij ) = 6 wi ! w j 7 Generalized Linear Discriminant Functions • Decision boundaries which separate between classes may not always be linear • The complexity of the boundaries may sometimes request the use of highly non-linear surfaces • A popular approach to generalize the concept of linear decision functions is to consider a generalized decision function as: g(x) = w1f1(x) + w2f2(x) + … + wNfN(x) + wN+1 (1) where fi(x), 1 ≤ i ≤ N are scalar functions of the pattern x, x ∈ Rn (Euclidean Space) 8 • Introducing fn+1(x) = 1 we get: N +1 T g(x) = "wi f i (x) = w .y i=1 T where w = (w1,w2,...,wN ,wN +1) T and y = (f 1(x), f 2(x),..., f N (x), f N +1(x)) • This latter representation of g(x) implies that any decision function defined by equation (1) can be treated !as linear in the (N + 1) dimensional space (N + 1 > n) • g(x) maintains its non-linearity characteristics in Rn 9 • The most commonly used generalized decision function is g(x) for which fi(x) (1 ≤ i ≤N) are polynomials g(x) = w0 + "w ixi + " "# ijxi x j + " " "$ ijkxi x j xk + .... i=1:N i=1:N j= 0 k=1:N i=1:N j=1:N • Quadratic decision functions for a 2-dimensional feature space ! 2 2 g(x) = w1x1 + w2 x1x2 + w3 x2 + w4 x1 + w5 x2 + w6 T 2 2 T here : w = (w1,w2,...,w6 ) and y = (x1 ,x1x2,x2 ,x1,x2,1) 10 ! • For patterns x ∈Rn, the most general quadratic decision function is given by: n n$1 n n 2 g(x) = #"ii xi + # # "ij xi x j + #wi xi + wn +1 (2) i=1 i=1 j= i+1 i=1 • The number of terms at the right-hand side is: l = N +1 ! n(n "1) = n + + n +1 2 (n +1)(n + 2) = 2 This is the total number of weights which are the free parameters of the problem – n != 3 => 10-dimensional – n = 10 => 65-dimensional 11 • In the case of polynomial decision functions of order m, a typical fi(x) is given by: f (x) x e1 x e2 ...x em i = i1 i2 im where 1" i1,i2,...,im " n and ei,1" i " m is 0 or 1. – It is a polynomial with a degree between 0 and m. To avoid repetitions, i1 ≤ i2 ≤ …≤ im ! n n n gm (x) = ... w x x ...x + gm"1(x) ## # i1i2 ...im i1 i2 im i1 =1 i2 = i1 im = im"1 where gm(x) is the most general polynomial decision function of order m 12 ! Example 1: Let n = 3 and m = 2 then: 3 3 g 2 ( x ) = w x x + w x + w x + w x + w ! ! i1i2 i1 i2 1 1 2 2 3 3 4 i1 =1 i2 =i1 2 2 2 = w11 x1 + w12 x1 x2 + w13 x1 x3 + w22 x2 + w23 x2 x3 + w33 x3 + w1 x1 + w2 x2 + w3 x3 + w4 Example 2: Let n = 2 and m = 3 then: 2 2 2 g 3 ( x ) = w x x x + g 2 ( x ) ! ! ! i1i2i3 i1 i2 i3 i1 =1 i2 =i1 i3 =i2 3 2 2 3 2 = w111 x1 + w112 x1 x2 + w122 x1 x2 + w222 x2 + g ( x ) 2 2 where g 2 ( x ) = w x x + g1 ( x ) ! ! i1i2 i1 i2 i1 =1 i2 =i1 2 2 13 = w11 x1 + w12 x1 x2 + w22 x2 + w1 x1 + w2 x2 + w3 14 15 • Augmentation • Incorporate labels into data • Margin 17 18 Learning Linear Classifiers Basic Ideas Directly “fit” linear boundary in feature space. • 1) Define an error function for classification – Number of errors? – Total distance of mislabeled points from boundary? – Least squares error – Within/between class scatter • Optimize the error function on training data – Gradient descent – Modified gradient descent – Various search proedures. – How much data is needed? 20 Two-category case • Given x1, x2,…, xn sample points, with true category labels: α1, α2,…,αn " =1 $ if point xi is from class '1 i % " = #1 i & if point xi is from class '2 • Decision are made according to: t if a xi > 0 class "1 is chosen t if a xi < 0 class "2 is chosen ! t • Now these decisions are wrong when a xi is negative and belongs to class ω1. Let yi = αi xi Then yi >0 when correctly labelled, negative otherwise. ! 21 • Separating vector (Solution vector) t – Find a such that a yi>0 for all i. t • a yi=0 defines a hyperplane with a as normal x2 x1 22 • Problem: solution not unique • Add Additional constraints: t Margin, the distance away from boundary a yi>b 23 Margins in data space b Larger margins promote uniqueness for underconstrained problems 24 Finding a solution vector by minimizing an error criterion What error function? How do we weight the points? All the points or only error points? Only error points: Perceptron Criterion t t t JP (a ) = $"(a yi ) Y = {yi | a yi < 0} yi #Y Perceptron Criterion with Margin t t t JP (a ) = "(a yi ) Y = {yi | a yi < b} $ 25 yi #Y ! 26 Minimizing Error via Gradient Descent 27 • The min(max) problem: min f (x) x • But we learned in calculus how to solve that kind of question! 28 Motivation • Not exactly, n • Functions: f : R ! R • High order polynomials: 1 3 1 5 1 7 x ! x + x ! x 6 120 5040 • What about function that don’t have an analytic presentation: “Black Box” 29 "1 % "1 % f := (x, y) ! cos$ x ' cos$ y' x #2 & #2 & 30 Directional Derivatives: first, the one dimension derivative: ! 31 Directional Derivatives : Along the Axes… !f (x, y) !y !f (x, y) !x 32 Directional Derivatives : In general direction… v ! R2 v =1 !f (x, y) !v 33 Directional Derivatives !f (x, y) !f (x, y) !y !x 34 2 The Gradient: Definition in R f : R2 ! R & 'f 'f # (f (x, y) := $ ! % 'x 'y " In the plane 35 The Gradient: Definition f : Rn ! R & 'f 'f # f (x ,..., x ) : $ ,..., ! ( 1 n = $ ! % 'x1 'xn " 36 The Gradient Properties • The gradient defines (hyper) plane approximating the function infinitesimally #f #f !z = "!x + "!y #x #y 37 The Gradient properties • Computing directional derivatives • By the chain rule: (important for later use) "f v =1 ( p) = (!f ) ,v "v p Want rate of change along ν, at a point p 38 The Gradient properties 1 is maximal choosing v = "(!f ) p !f (!f ) p !v is minimal choosing #1 v = "(!f ) p (!f ) p (intuition: the gradient points to the greatest change in direction) 39 The Gradient Properties n • Let f : R " " # R be a smooth function around P, if f has local minimum (maximum) at p then, ! r ("f ) p = 0 (Necessary for local min(max)) 40 ! The Gradient Properties Intuition 41 The Gradient Properties Formally: for any n We get: v ! R \{0} df (p + t " v) 0 = = (#f ) ,v dt p r $ (#f ) p = 0 42 ! The Gradient Properties • We found the best INFINITESIMAL DIRECTION at each point, • Looking for minimum using “blind man” procedure • How can get to the minimum using this knowledge? 43 Steepest Descent • Algorithm n Initialization: x0 ! R loop while "f (xi ) > # compute search direction hi = "f (xi ) update xi+1 = xi " # $ hi end loop ! ! ! 44 Setting the Learning Rate Intelligently Minimizing this approximation: 45 Minimizing Perceptron Criterion Perceptron Criterion t t t JP (a ) = $"a yi Y = {yi | a yi < 0} yi #Y t t % a (JP (a )) = % a $"a yi yi #Y = $"yi yi #Y Which gives the update rule : a( j +1) = a(k) +&k $ yi yi #Yk 46 Where Yk is the misclassified set at step k ! Batch Perceptron Update 47 48 Simplest Case 49 When does perceptron rule converge? 50 + + 51 Four learning criteria Different Error Functions as a function of weights in a linear classifier.

Load more