Kernel Methods

Machine Learning Srihari Kernel Methods Sargur Srihari [email protected] 1 Machine Learning Srihari Topics in Kernel Methods 1. Linear Models vs Memory-based models 2. Stored Sample Methods 3. Kernel Functions • Dual Representations • Constructing Kernels 4. Extension to Symbolic Inputs 5. Fisher Kernel 2 Machine Learning Srihari Linear Models vs Memory-based models • Linear parametric models for regression and classification have the form y(x,w) • During learning we either get a maximum likelihood estimate of w or a posterior distribution of w • Training data is then discarded • Prediction based only on vector w • This is true of Neural networks as well • Memory-based methods keep the training samples (or a subset) and use them during the prediction phase 3 Machine Learning Srihari Memory-Based Methods • Training data points are used in prediction • Examples of such methods • Parzen probability density model • Linear combination of kernel functions centered on each training data point • Nearest neighbor classification • These are memory-based methods • They require a metric to be defined • Fast to train, slow to predict 4 Machine Learning Srihari Kernel Functions • Linear models can be re-cast • into equivalent dual where predictions are based on kernel functions evaluated at training points • Kernel function is given by k (x,x) = ϕ(x)T ϕ(x) • where ϕ(x) is a fixed nonlinear mapping (basis function) • Kernel is a symmetric function of its arguments k (x,x) = k (x,x) • Kernel can be interpreted as similarity of x and x • Simplest is identity mapping in feature space ϕ(x) = x • In which case k (x,x) = xTx • Called Linear Kernel 5 Machine Learning Srihari Kernel Trick • Formulated as inner product allows extending well-known algorithms • by using the kernel trick • Basic idea of kernel trick • If an input vector x appears only in the form of scalar products then we can replace scalar products with some other choice of kernel • Used widely • in support vector machines • in developing non-linear variant of PCA • In kernel Fisher discriminant 6 Machine Learning Srihari Other Forms of Kernel Functions • Function of difference between arguments k (x,x) = k (x-x) • Called stationary kernel since invariant to translation • Radial basis functions k (x,x) = k (||x-x||) • Depends on magnitude of distance between arguments • Note that the kernel function is scalar while x is M-dimensional For these to be valid kernel functions they should be shown to have the property k (x,x) = φ (x)T φ(x) 7 Machine Learning Srihari Dual Representation • Linear models for regression/classification can be reformulated in terms of dual representation • In which kernel function arises naturally • Plays important role in SVMs • Consider linear regression model • Parameters from minimizing sum-of-squares N 1 T 2 λ T J(w) = ∑{w φ(x n ) − tn} + w w 2 n 1 2 = ϕ is the set of M where w = (w ,..,w )T , = ( ,.. )T 0 M-1 φ φ0 φM-1 basis functions we have N samples {x1,..x N } or feature vector λ is the regularization coefficient • Minimum obtained by setting gradient of J(w) wrt w equal to zero € Machine Learning Srihari Solution for w: linear combination of φ (xn) • Equating derivative J(w) wrt w to zero we get N 1 T w = − ∑{w φ(x n ) − tn }φ(x n ) λ n=1 N = ∑anφ(x n ) n=1 = ΦT a • Solution for w: linear combination of φ (xn) whose coefficients€ are functions of w where th T • Φ is design matrix whose n row is given by φ (xn) % φ0 (x1) . φM −1(x1)( ' * ' . * Φ = ' φ0 (xn ) . φM −1(xn )* is a N × M matrix ' . * ' * &φ 0 (xN ) . φM −1(xN )) T • Vector a=(a1,..,aN) has the definition € 1 T an = − {w φ(xn )−tn } λ Machine Learning Srihari Transformation from w to a T • Thus we have w = Φ a • Instead of working with parameter vector w we can reformulate least squares algorithm in terms of parameter vector a • giving rise to dual representation 1 T an = − {w φ(xn )−tn } λ • Although the definition of a still includes w It can be eliminated by the use of the kernel function 10 Machine Learning Srihari Gram Matrix and Kernel Function • Gram matrix K=ΦΦT is N x N matrix • with elements T Knm= ϕ (xn) ϕ (xm)=k(xn,xm) • where kernel function k (x,x) = ϕ (x)T ϕ (x) ⎡ ⎤ ⎢ k(x ,x ) . k(x ,x ) ⎥ Gram Matrix Definition: ⎢ 1 1 1 N ⎥ ⎢ . ⎥ Given N vectors, it is the K = ⎢ ⎥ ⎢ . ⎥ matrix of all inner products ⎢ ⎥ ⎢ k(x ,x ) k(x ,x ) ⎥ ⎣ N 1 N N 1 ⎦ • Notes: • Φ is NxM and K is NxN Note: N M times M N • K is a matrix of similarities of pairs of samples (thus it is symmetric) 11 MachineError Learning Function in Terms of Gram Matrix Srihari • Error Function is 1 N 2 λ J( ) = T (x )−t + T w ∑{w φ n n } w w 2 n=1 2 • Substituting w = ΦTa into J(w) gives 1 1 λ J(w) = aT ΦΦT ΦΦTa −aT ΦΦT t + tT t + aT ΦΦTa 2 2 2 T where t = (t1,..,tN) • Error function is written in terms of Gram matrix as 1 1 λ J(a) = aTKKa −aTKt + tT t + aTKa 2 2 2 T 1 T • Solving for a by combining w=Φ a and an = − {w φ(x n ) − tn } λ -1 a =(K +λIN) t Solution for a can be expressed as a linear€ combination of elements of ϕ(x) whose coefficients are entirely in terms of kernel k(x,x) from which we can recover original formulation in terms of parameters w Machine Learning Prediction Function Srihari • Prediction for new input x -1 T • We can write a =(K +λIN) t by combining w =Φ a and 1 T an = − {w φ(x n ) − tn } λ • Substituting back into linear regression model, € y(x) = wTφ(x) = aT Φφ(x) = k(x)T (K + λI )−1 t where k(x) has elements k (x) = k(x ,x) N n n • Prediction is a linear combination of the target values from the training set. Machine Learning Srihari Advantage of Dual Representation • Solution for a is entirely in terms of k(x,x) • From a we can recover w using w = Φta w = (ΦT Φ)−1ΦT t • In parametric formulation, solution is ML • We invert M x M matrix, since Φ is N x M -1 • In dual formulation, solution is a =(K +λIN) t • We are inverting an N x N matrix • An apparent disadvantage, since N >>M • Advantage of dual: work with kernel k(x,x), so: • avoid working with a feature vector ϕ(x) and • problems associated with very high or infinite dimensionality of x 14 Machine Learning Srihari Constructing Kernels • To exploit kernel substitution need valid kernels • Two methods: • (1) from ϕ (x) to k(x,x’) • (2) from k(x,x’) to ϕ (x) • First Method • Choose ϕ (x) and use it to find corresponding kernel k(x,x ') = φ(x)T φ(x ') M = ( ) ( ') ∑φi x φi x i=1 • where ϕi(x) are basis functions • One-dimensional input space is shown next 15 Machine Learning Srihari Kernel Functions from basis functions One-dimensional input space Polynomials Gaussian Logistic Sigmoid Basis Functions ϕi(x) Kernel Functions k(x,x) = ϕ (x)Tϕ (x’) Red cross is x 16 Machine Learning Srihari Method 2: Direct Construction of Kernels • Kernel has to correspond to a scalar product in some (perhaps infinite dimensional) space • Consider kernel function k(x,z) = (xTz)2 T 2 2 k(x,z) = (x z) = (x1z1 + x2z2) • In 2-D space 2 2 2 2 = x1z1 + 2x1z1x2z2 + x2z2 2 2 2 2 T = (x1 , 2x1x2,x2 )(z1 , 2z1z2,z2 ) = φ( )T φ( ) x z • Feature mapping takes the form φ(x) = (x 2, 2x x ,x 2) 1 1 2 2 • Comprises of all second order terms with a specific weighting • Inner product needs computing 6 feature values, 9 multiplications • Kernel function k(x,z) has 2 multiplies and a squaring • For(xTz+c)2 we get constant, linear, second order terms • For (xTz+c)M we get all powers of x (monomials) 17 Machine Learning Srihari Testing whether a function is a valid kernel • Without constructing ϕ(x), Necessary and sufficient condition for k(x,x) to be a kernel is • Gram matrix K, with elements k(xn,xm) is positive {xn}semi-definite for all possible choices of the set • Positive semi-definite is not the same thing as a matrix whose elements are non- negative zT z 0 for non - zero vectors z with real entries • It means K ≥ i.e.,∑∑Knm zn zm ≥ 0 for any real numbers zn ,zm n m • Mercers theorem: any continuous, symmetric, positive semi-definite kernel function k(x, y) can be expressed as a dot product in a high-dimensional € space • New kernels can be constructed from simpler kernels as building blocks 18 Machine Learning Srihari Techniques for Constructing Kernels • Given valid kernels k1(x,x) and k2(x,x) the following new kernels will be valid 1. k(x,x) =ck (x,x) 1 Where 2. k(x,x)=f(x)k1(x,x)f(x) f (.) is any function 3. k(x,x)=q(k1(x,x)) q(.) is a polynomial with non-negative coefficients 4. k(x,x)=exp(k1(x,x)) 5. k(x,x)=k1(x,x)+k2(x,x) 6. k(x,x)=k1(x,x)k2(x,x) M 7. k(x,x)=k3(f(x).f(x)) f(x) is a function from x to R k is a valid kernel in RM T 3 8.

Kernel Methods

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support