Kernel Methods

Machine Learning Srihari Kernel Methods Sargur Srihari [email protected] 1 Machine Learning Srihari Topics in Kernel Methods 1. Linear Models vs Memory-based models 2. Stored Sample Methods 3. Kernel Functions • Dual Representations • Constructing Kernels 4. Extension to Symbolic Inputs 5. Fisher Kernel 2 Machine Learning Srihari Linear Models vs Memory-based models • Linear parametric models for regression and classification have the form y(x,w) • During learning we either get a maximum likelihood estimate of w or a posterior distribution of w • Training data is then discarded • Prediction based only on vector w • This is true of Neural networks as well • Memory-based methods keep the training samples (or a subset) and use them during the prediction phase 3 Machine Learning Srihari Memory-Based Methods • Training data points are used in prediction • Examples of such methods • Parzen probability density model • Linear combination of kernel functions centered on each training data point • Nearest neighbor classification • These are memory-based methods • They require a metric to be defined • Fast to train, slow to predict 4 Machine Learning Srihari Kernel Functions • Linear models can be re-cast • into equivalent dual where predictions are based on kernel functions evaluated at training points • Kernel function is given by k (x,x) = ϕ(x)T ϕ(x) • where ϕ(x) is a fixed nonlinear mapping (basis function) • Kernel is a symmetric function of its arguments k (x,x) = k (x,x) • Kernel can be interpreted as similarity of x and x • Simplest is identity mapping in feature space ϕ(x) = x • In which case k (x,x) = xTx • Called Linear Kernel 5 Machine Learning Srihari Kernel Trick • Formulated as inner product allows extending well-known algorithms • by using the kernel trick • Basic idea of kernel trick • If an input vector x appears only in the form of scalar products then we can replace scalar products with some other choice of kernel • Used widely • in support vector machines • in developing non-linear variant of PCA • In kernel Fisher discriminant 6 Machine Learning Srihari Other Forms of Kernel Functions • Function of difference between arguments k (x,x) = k (x-x) • Called stationary kernel since invariant to translation • Radial basis functions k (x,x) = k (||x-x||) • Depends on magnitude of distance between arguments • Note that the kernel function is scalar while x is M-dimensional For these to be valid kernel functions they should be shown to have the property k (x,x) = φ (x)T φ(x) 7 Machine Learning Srihari Dual Representation • Linear models for regression/classification can be reformulated in terms of dual representation • In which kernel function arises naturally • Plays important role in SVMs • Consider linear regression model • Parameters from minimizing sum-of-squares N 1 T 2 λ T J(w) = ∑{w φ(x n ) − tn} + w w 2 n 1 2 = ϕ is the set of M where w = (w ,..,w )T , = ( ,.. )T 0 M-1 φ φ0 φM-1 basis functions we have N samples {x1,..x N } or feature vector λ is the regularization coefficient • Minimum obtained by setting gradient of J(w) wrt w equal to zero € Machine Learning Srihari Solution for w: linear combination of φ (xn) • Equating derivative J(w) wrt w to zero we get N 1 T w = − ∑{w φ(x n ) − tn }φ(x n ) λ n=1 N = ∑anφ(x n ) n=1 = ΦT a • Solution for w: linear combination of φ (xn) whose coefficients€ are functions of w where th T • Φ is design matrix whose n row is given by φ (xn) % φ0 (x1) . φM −1(x1)( ' * ' . * Φ = ' φ0 (xn ) . φM −1(xn )* is a N × M matrix ' . * ' * &φ 0 (xN ) . φM −1(xN )) T • Vector a=(a1,..,aN) has the definition € 1 T an = − {w φ(xn )−tn } λ Machine Learning Srihari Transformation from w to a T • Thus we have w = Φ a • Instead of working with parameter vector w we can reformulate least squares algorithm in terms of parameter vector a • giving rise to dual representation 1 T an = − {w φ(xn )−tn } λ • Although the definition of a still includes w It can be eliminated by the use of the kernel function 10 Machine Learning Srihari Gram Matrix and Kernel Function • Gram matrix K=ΦΦT is N x N matrix • with elements T Knm= ϕ (xn) ϕ (xm)=k(xn,xm) • where kernel function k (x,x) = ϕ (x)T ϕ (x) ⎡ ⎤ ⎢ k(x ,x ) . k(x ,x ) ⎥ Gram Matrix Definition: ⎢ 1 1 1 N ⎥ ⎢ . ⎥ Given N vectors, it is the K = ⎢ ⎥ ⎢ . ⎥ matrix of all inner products ⎢ ⎥ ⎢ k(x ,x ) k(x ,x ) ⎥ ⎣ N 1 N N 1 ⎦ • Notes: • Φ is NxM and K is NxN Note: N M times M N • K is a matrix of similarities of pairs of samples (thus it is symmetric) 11 MachineError Learning Function in Terms of Gram Matrix Srihari • Error Function is 1 N 2 λ J( ) = T (x )−t + T w ∑{w φ n n } w w 2 n=1 2 • Substituting w = ΦTa into J(w) gives 1 1 λ J(w) = aT ΦΦT ΦΦTa −aT ΦΦT t + tT t + aT ΦΦTa 2 2 2 T where t = (t1,..,tN) • Error function is written in terms of Gram matrix as 1 1 λ J(a) = aTKKa −aTKt + tT t + aTKa 2 2 2 T 1 T • Solving for a by combining w=Φ a and an = − {w φ(x n ) − tn } λ -1 a =(K +λIN) t Solution for a can be expressed as a linear€ combination of elements of ϕ(x) whose coefficients are entirely in terms of kernel k(x,x) from which we can recover original formulation in terms of parameters w Machine Learning Prediction Function Srihari • Prediction for new input x -1 T • We can write a =(K +λIN) t by combining w =Φ a and 1 T an = − {w φ(x n ) − tn } λ • Substituting back into linear regression model, € y(x) = wTφ(x) = aT Φφ(x) = k(x)T (K + λI )−1 t where k(x) has elements k (x) = k(x ,x) N n n • Prediction is a linear combination of the target values from the training set. Machine Learning Srihari Advantage of Dual Representation • Solution for a is entirely in terms of k(x,x) • From a we can recover w using w = Φta w = (ΦT Φ)−1ΦT t • In parametric formulation, solution is ML • We invert M x M matrix, since Φ is N x M -1 • In dual formulation, solution is a =(K +λIN) t • We are inverting an N x N matrix • An apparent disadvantage, since N >>M • Advantage of dual: work with kernel k(x,x), so: • avoid working with a feature vector ϕ(x) and • problems associated with very high or infinite dimensionality of x 14 Machine Learning Srihari Constructing Kernels • To exploit kernel substitution need valid kernels • Two methods: • (1) from ϕ (x) to k(x,x’) • (2) from k(x,x’) to ϕ (x) • First Method • Choose ϕ (x) and use it to find corresponding kernel k(x,x ') = φ(x)T φ(x ') M = ( ) ( ') ∑φi x φi x i=1 • where ϕi(x) are basis functions • One-dimensional input space is shown next 15 Machine Learning Srihari Kernel Functions from basis functions One-dimensional input space Polynomials Gaussian Logistic Sigmoid Basis Functions ϕi(x) Kernel Functions k(x,x) = ϕ (x)Tϕ (x’) Red cross is x 16 Machine Learning Srihari Method 2: Direct Construction of Kernels • Kernel has to correspond to a scalar product in some (perhaps infinite dimensional) space • Consider kernel function k(x,z) = (xTz)2 T 2 2 k(x,z) = (x z) = (x1z1 + x2z2) • In 2-D space 2 2 2 2 = x1z1 + 2x1z1x2z2 + x2z2 2 2 2 2 T = (x1 , 2x1x2,x2 )(z1 , 2z1z2,z2 ) = φ( )T φ( ) x z • Feature mapping takes the form φ(x) = (x 2, 2x x ,x 2) 1 1 2 2 • Comprises of all second order terms with a specific weighting • Inner product needs computing 6 feature values, 9 multiplications • Kernel function k(x,z) has 2 multiplies and a squaring • For(xTz+c)2 we get constant, linear, second order terms • For (xTz+c)M we get all powers of x (monomials) 17 Machine Learning Srihari Testing whether a function is a valid kernel • Without constructing ϕ(x), Necessary and sufficient condition for k(x,x) to be a kernel is • Gram matrix K, with elements k(xn,xm) is positive {xn}semi-definite for all possible choices of the set • Positive semi-definite is not the same thing as a matrix whose elements are non- negative zT z 0 for non - zero vectors z with real entries • It means K ≥ i.e.,∑∑Knm zn zm ≥ 0 for any real numbers zn ,zm n m • Mercers theorem: any continuous, symmetric, positive semi-definite kernel function k(x, y) can be expressed as a dot product in a high-dimensional € space • New kernels can be constructed from simpler kernels as building blocks 18 Machine Learning Srihari Techniques for Constructing Kernels • Given valid kernels k1(x,x) and k2(x,x) the following new kernels will be valid 1. k(x,x) =ck (x,x) 1 Where 2. k(x,x)=f(x)k1(x,x)f(x) f (.) is any function 3. k(x,x)=q(k1(x,x)) q(.) is a polynomial with non-negative coefficients 4. k(x,x)=exp(k1(x,x)) 5. k(x,x)=k1(x,x)+k2(x,x) 6. k(x,x)=k1(x,x)k2(x,x) M 7. k(x,x)=k3(f(x).f(x)) f(x) is a function from x to R k is a valid kernel in RM T 3 8.

Kernel Methods

Large-Scale Kernel Ranksvm

A Kernel Method for Multi-Labelled Classification

Density Based Data Clustering

Kernel and Image

A User's Guide to Support Vector Machines

The Forgetron: a Kernel-Based Perceptron on a Fixed Budget

Math 120 Homework 3 Solutions

Kernel Methods Through the Roof: Handling Billions of Points Efﬁciently

Discrete Topological Transformations for Image Processing Michel Couprie, Gilles Bertrand

Kernel Methodsmethods Simple Idea of Data Fitting

Abelian Categories

An Algorithmic Introduction to Clustering