Support Vector Machines SVM Discussion Overview

Support Vector Machines SVM Discussion Overview 1. Importance of SVMs 2. Overview of Mathematical Techniques Employed 3. Margin Geometry 4. SVM Training Methodology 5. Overlapping Distributions 6. Dealing with Multiple Classes 7. SVM and Computational Learning Theory 8. Relevance Vector Machines CSE555: Srihari 1. Importance of SVM •SVM is a discriminative method that brings together: 1. computational learning theory 2. previously known methods in linear discriminant functions 3. optimization theory • Also called Sparse kernel machines • Kernel methods predict based on linear combinations of a kernel function evaluated at the training points, e.g., Parzen Window • Sparse because not all pairs of training points need be used • Also called Maximum margin classifiers • Widely used for solving problems in classification, regression and novelty detection CSE555: Srihari 2. Mathematical Techniques Used 1. Linearly separable case considered since appropriate nonlinear mapping φ to a high dimension two categories are always separable by a hyperplane 2. To handle non-linear separability: • Preprocessing data to represent in much higher- dimensional space than original feature space • Kernel trick reduces computational overhead CSE555: Srihari 3. Support Vectors and Margin • Support vectors are those nearest patterns at distance b from hyperplane • SVM finds hyperplane with maximum distance (margin distance b) from nearest training patterns ThreeCSE555: support Srihari vectors are shown as solid dots Margin Maximization • Why maximize margin? • Motivation found in computational learning theory or statistical learning theory (PAC learning-VC dimension) • Insight given as follows (Tong, Koller 2000): • Model distributions for each class using Parzen density estimators using Gaussian kernels with common parameter σ2 • Instead of optimum boundary, determine best hyperplane relative to learned density model • As σ2Æ 0 optimum hyperplane has maximum margin • Hyperplane becomes independent of data points that are CSE555:not support Srihari vectors Distance from arbitrary point and plane • Hyperplane: t g ( x ) = w • x + w 0 where w is the weight vector and w0 is bias g(x) r = • Lemma:: Distance from x to the plane is || w || w Proof: Let x =xp +r where r is the distance from x to the plane, ||w|| = 0 thenw t 2 g(x)=wt (x +r )+w t w w ||w|| p 0 = w xp +w0 +r = g(x )+r =r ||w|| QED ||w|| ||w|| p ||w|| w x=xp +r g ( x ) = w t • x + w = 0 ||w|| Corollary: Distance of origin to plane is 0 r = g(0)/||w|| = w0/||w|| g(x) since t r = g (0) = w 0 + w 0 = w 0 ||w|| Thus w0=0 implies that plane passes through origin xp g ( x ) > 0 g ( y ) < 0 CSE555: Srihari Choosing a margin t • Augmented space: g(y) = a y by choosing a0= w0 and y0=1, i.e, plane passes through origin • For each of the patterns, let zk = +1 depending on whether pattern k is in class ω1or ω2 • Thus if g(y)=0 is a separating hyperplane then zk g(yk) > 0, k = 1,.., n • Since distance of a point y to hyperplane g(y)=0 is g(y) y g(y)= aty || a || we could require that hyperplane be such that all points are at least distant b from it, i.e., zk g(yk ) b ≥ b CSE555: Srihari || a || SVM Margin geometry y2 g(y) = aty= 0 g(y) = 1 g(y) = -1 y1 y2 y1 Optimal hyperplane is Shortest line connecting Convex hulls, which CSE555: Srihari orthogonal to.. has length = 2/||a|| Statement of Optimization Problem • The goal is to find the weight vector a that satisfies z g(y ) k k ≥ b, k =1,.....n a • while maximizing b • To ensure uniqueness we impose the constraint b ||a|| = 1 or b = 1/||a|| • Which implies that we also require that ||a||2 be minimized • Support vectors are (transformed) training patterns which represent equality in above equation • Called a quadratic optimization problem since we are trying to minimize a quadratic function subject to a set of linear inequality constraints CSE555: Srihari 4. SVM Training Methodology 1. Training is formulated as an optimization problem • Dual problem is stated to reduce computational complexity • Kernel trick is used to reduce computation 2. Determination of the model parameters corresponds to a convex optimization problem • Solution is straightforward (local solution is a global optimum) 3. Makes use of Lagrange multipliers CSE555: Srihari Joseph-Louis Lagrange 1736-1813 • French Mathematician • Born in Turin, Italy • Succeeded Euler at Berlin academy • Narrowly escaped execution in French revolution due to Lovoisier who himself was guillotined • Made key contributions to calculus and dynamics CSE555: Srihari SVM Training: Optimization Problem optimize 1 arg min || a ||2 a,b 2 subject to constraints t zk a yk ≥1, k =1,.....n • Can be cast as an unconstrained problem by introducing Lagrange undetermined multipliers with one multiplier for each constraint αk • The Lagrange function is n 1 2 t L()a,α = a − ∑αk [zka yk −1] 2 k=1 CSE555: Srihari Optimization of Lagrange function • The Lagrange function is n 1 2 t L()a,α = a − ∑αk [zka yk −1] 2 k=1 • We seek to minimize L ( ) • with respect to the weight vector a and • maximize it w.r.t . the undetermined multipliers ≥ 0 αk • Last term represents the goal of classifying the points correctly • Karush-Kuhn-Tucker construction shows that this can be recast as a maximization problem which is computationally better CSE555: Srihari Dual Optimization Problem • Problem is reformulated as one of maximizing n 1 n n L()α = ∑α k − ∑∑ α kα j zk z j k(y j , yk ) k =1 2 k ==11j • Subject to the constraints n ∑α k zk = 0 α k ≥ 0, k =1,....., n k =1 given the training data • where the kernel function is defined by t t k(y j , yk ) = y j ⋅ yk = φ(x j ) ⋅φ(xk ) CSE555: Srihari Solution of Dual Problem • Implementation: • Solved using quadratic programming • Alternatively, since it only needs inner products of training data • It can be implemented using kernel functions • which is a crucial property for generalizing to non-linear case • The solution is given by a = ∑α k zk yk k CSE555: Srihari Summary of SVM Optimization Problems Different Notation here! Quadratic term CSE555: Srihari Kernel Function: key property • If kernel function is chosen with property K(x,y) = (φ (x). φ(y)) then computational expense of increased dimensionality is avoided. • Polynomial kernel K(x,y) = (x.y)d can be shown (next slide) to correspond to a map φ into the space spanned by all products of exactly d dimensions. CSE555: Srihari A Polynomial Kernel Function Suppose x = (x1, x2 ) is the input vector The feature space mapping is : 2 2 t ϕ(x) = (x1 , 2x1x2 , x2 ) Then inner product is same 2 2 t 2 2 2 ϕ(x)ϕ(y) = (x1 , 2x1x2 , x2 ) (y1 , 2y1 y2 , y2 )= (x1 y1 + x2 y2 ) Polynomial kernel function to compute the same value is 2 t 2 2 K(x, y) = (x.y) = ()()x1,x2 (y1,y2 )= (x1 y1 + x2 y2 ) or K(x, y) = ϕ(x)ϕ(y) • Inner product ϕ ( x ) ϕ ( y ) needs computing six feature values and 3 x 3 = 9 multiplications • Kernel function K(x,y) has 2 multiplications and a squaring CSE555: Srihari Another Polynomial (quadratic) kernel function •K(x,y) = (x.y + 1)2 • This one maps d = 2, p = 2 into a six- dimensional space • Contains all the powers of x K(x,y) = ϕ(x)ϕ(y) where 2 2 ϕ(x) = (x1 ,x 2 , 2x1, 2x 2 , 2x1x 2 ,1) • Inner product needs 36 multiplications • Kernel function needs 4 multiplications CSE555: Srihari CSE555: Srihari Non-Linear Case • Mapping function φ (.) to a sufficiently high dimension • So that data from two categories can always be separated by a hyperplane • Assume each pattern xk has been transformed to yk=φ (xk), for k=1,.., n • First choose the non-linear φ functions • To map the input vector to a higher dimensional feature space • Dimensionality of space can be arbitrarily high only limited by computational resources CSE555: Srihari Mapping into Higher Dimensional Feature Space • Mapping each input point x by map ⎛ 1 ⎞ ⎜ ⎟ y = Φ(x) = ⎜ x ⎟ ⎜ 2 ⎟ ⎝x ⎠ Points on 1-d line are mapped onto curve in 3-d. • Linear separation in 3-d space is possible. Linear discriminant function in 3-d is in the form g(x) =CSE555:a1 y1 + Sriharia2 y2 + a3 y3 Pattern Transformation using Kernels • Problem with high-dimensional mapping • Very many parameters • Polynomial of degree p over d variables leads to O(dp) variables in feature space • Example: if d = 50 and p =2 we need a feature space of size 2500 • Solution: • Dual Optimization problem needs only inner products • Each pattern xk transformed into pattern ykwhere = Φ( ) yk xk • Dimensionality of mapped space can be arbitrarily high CSE555: Srihari Example of SVM results • Two classes in two dimensions • Synthetic Data • Shows contours of constant g (x) • Obtained from SVM with Gaussian kernel function • Decision boundary is shown • Margin boundaries are shown • Support vectors are shown CSE555: Srihari • Shows sparsity of SVM Demo http://www.csie.ntu.edu.tw/~cjlin/libsvm/ Svmtoy.exe CSE555: Srihari SVM for the XOR problem • XOR: binary valued features x1, x2 • not solved by linear discriminant • function φ maps input x = [x1, x2] into six-dimensional feature 2 2 space y = [1, /2x1, /2x2, /2x1x2 , x1 , x2 ] input space feature sub-space Hyperplanes Corresponding to /2x1 x2 = +1 Hyperbolas Corresponding to x1 x2 = +1 CSE555: Srihari SVM for XOR: maximization problem • We seek to maximize 4 4 4 1 t ∑αk − ∑ ∑αkα jz jzk y jyk k=1 2 k=1 j=1 • Subject to the constraints α1 −α 2 +α3 −α 4 = 0 0 ≤ α k k =1, 2, 3, 4, • From problem symmetry, at the solution α1 = α3 ,andα 2 = α 4 CSE555: Srihari SVM for XOR: maximization problem • Can use iterative gradient descent • Or use analytical techniques for small problem • The solution is a* = (1/8,1/8,1/8,1/8) • Last term of Optimizn Problem implies that all four points are support vectors (unusual and due to symmetric nature of XOR) Hyperplanes • The final discriminant function is g(x1,x2) = x1.

Load more