Statistical Learning and Kernel Methods

Statistical Learning and Kernel Metho ds Bernhard Scholkopf Microsoft Research Limited, 1 Guildhall Street, Cambridge CB2 3NH, UK [email protected] http://research.microsoft.com/bsc February 29, 2000 Technical Rep ort MSR-TR-2000-23 Microsoft Research Microsoft Corp oration One Microsoft Way Redmond, WA98052 Lecture notes for a course to be taught at the Interdisciplinary College 2000, Gunne, Germany, March2000. Abstract We brie y describ e the main ideas of statistical learning theory,sup- port vector machines, and kernel feature spaces. Contents 1 An Intro ductory Example 1 2 Learning Pattern Recognition from Examples 4 3 Hyp erplane Classi ers 5 4 Supp ort Vector Classi ers 8 5 Supp ort Vector Regression 11 6 Further Developments 14 7 Kernels 15 8 Representing Similarities in Linear Spaces 18 9 Examples of Kernels 21 10 Representating Dissimilarities in Linear Spaces 22 1 An Intro ductory Example Supp ose wearegiven empirical data x ;y ;:::;x ;y 2X f1g: 1 1 1 m m Here, the domain X is some nonempty set that the patterns x are taken from; i the y are called labels or targets . i Unless stated otherwise, indices i and j will always be understo o d to run over the training set, i.e. i; j =1;:::;m. Note that wehave not made any assumptions on the domain X other than it being a set. In order to study the problem of learning, we need additional structure. In learning, wewant to b e able to generalize to unseen data p oints. In the case of pattern recognition, this means that given some new pattern x 2 X , we want to predict the corresp onding y 2 f1g. By this we mean, lo osely sp eaking, that we cho ose y suchthatx; y is in some sense similar to the training examples. Tothis end, we need similarity measures in X and in f1g. The latter is easy,astwo target values can only b e identical or di erent. For the former, we require a similarity measure k : XX ! R; 0 0 x; x 7! k x; x ; 2 0 i.e., a function that, given two examples x and x , returns a real number char- acterizing their similarity. For reasons that will b ecome clear later, the function k is called a kernel [13,1,8]. Atyp e of similarity measure that is of particular mathematical app eal are 0 N dot pro ducts. For instance, given two vectors x; x 2 R , the canonical dot pro duct is de ned as N X 0 0 x x := x x : 3 i i i=1 Here, x denotes the i-th entry of x. i The geometrical interpretation of this dot pro duct is that it computes the 0 cosine of the angle b etween the vectors x and x ,provided they are normalized to length 1. Moreover, it allows computation of the length of a vector x as p x x , and of the distance b etween twovectors as the length of the di erence vector. Therefore, b eing able to compute dot pro ducts amounts to b eing able to carry out all geometrical constructions that can be formulated in terms of angles, lenghts and distances. Note, however, that wehave not made the assumption that the patterns live in a dot pro duct space. In order to b e able to use a dot pro duct as a similarity measure, we therefore rst need to emb ed them into some dot pro duct space F , N which need not b e identical to R . Tothisend,we use a map :X ! F x 7! x: 4 1 The space F is called a featurespace . To summarize, emb edding the data into F has three b ene ts. 1. It lets us de ne a similarity measure from the dot pro duct in F , 0 0 0 k x; x :=x x = x x : 5 2. It allows us to deal with the patterns geometrically, and thus lets us study learning algorithm using linear algebra and analytic geometry. 3. The freedom to cho ose the mapping will enable us to design a large variety of learning algorithms. For instance, consider a situation where the inputs already live in a dot pro duct space. In that case, we could directly de ne a similarity measure as the dot pro duct. However, wemight still cho ose to rst apply a nonlinear map to change the representation into one that is more suitable for a given problem and learning algorithm. We are now in the p osition to describ e a pattern recognition learning algorithm that is arguably one of the simplest p ossible. The basic idea is to compute the means of the two classes in feature space, X 1 c = x ; 6 1 i m 1 fi:y =+1g i X 1 c = x ; 7 2 i m 2 fi:y =1g i where m and m are the numb er of examples with p ositive and negative lab els, 1 2 resp ectively. We then assign a new p oint x to the class whose mean is closer to it. This geometrical construction can b e formulated in terms of dot pro ducts. Half-wayinbetween c and c lies the p oint c := c + c =2. We compute the 1 2 1 2 class of x bychecking whether the vector connecting c and x encloses an angle smaller than =2 with the vector w := c c connecting the class means, in 1 2 other words y = sgn x c w y = sgn x c + c =2 c c 1 2 1 2 = sgn x c x c +b: 8 1 2 Here, wehave de ned the o set 1 2 2 kc k kc k : 9 b := 2 1 2 It will prove instructive to rewrite this expression in terms of the patterns x in the input domain X . To this end, note that we do not have a dot pro duct i in X , all wehaveisthe similarity measure k cf. 5. Therefore, we need to 2 rewrite everything in terms of the kernel k evaluated on input patterns. Tothis end, substitute 6 and 7 into 8 to get the decision function 0 1 X X 1 1 @ A y = sgn x x x x +b i i m m 1 2 fi:y =+1g fi:y =1g i i 1 0 X X 1 1 A @ k x; x k x; x +b : 10 = sgn i i m m 1 2 fi:y =+1g fi:y =1g i i Similarly, the o set b ecomes 0 1 X X 1 1 1 @ A b := k x ;x k x ;x : 11 i j i j 2 2 2 m m 2 1 fi;j :y =y =1g fi;j :y =y =+1g i j i j Let us consider one well-known sp ecial case of this typ e of classi er. Assume that the class means have the same distance to the origin hence b = 0, and that k can b e viewed as a density, i.e. it is p ositive and has integral 1, Z 0 0 k x; x dx =1 for all x 2X: 12 X In order to state this assumption, we have to require that we can de ne an integral on X . If the ab ove holds true, then 10 corresp onds to the so-called Bayes decision b oundary separating the two classes, sub ject to the assumption that the two classes were generated from two probability distributions that are correctly estimated by the Parzen windows estimators of the two classes, X 1 p x:= k x; x 13 1 i m 1 fi:y =+1g i X 1 k x; x : 14 p x:= i 2 m 2 fi:y =1g i Given some p oint x, the lab el is then simply computed bychecking which of the two, p xorp x, is larger, which directly leads to 10. Note that this decision 1 2 is the b est we can do if wehave no prior information ab out the probabilities of the two classes. The classi er 10 is quite close to the typ es of learning machines that we will b e interested in. It is linear in the feature space, while in the input domain, it is represented byakernel expansion. It is example-based in the sense that the kernels are centered on the training examples, i.e. one of the two arguments of the kernels is always a training example. The main point where the more sophisticated techniques to be discussed later will deviate from 10 is in the selection of the examples that the kernels are centered on, and in the weight that is put on the individual kernels in the decision function. Namely,itwillno 3 longer b e the case that al l training examples app ear in the kernel expansion, and the weights of the kernels in the expansion will no longer b e uniform. In the feature space representation, this statement corresp onds to saying that we will study all normal vectors w of decision hyp erplanes that can b e represented as linear combinations of the training examples. For instance, we mightwant to remove the in uence of patterns that are very far away from the decision b oundary, either since we exp ect that they will not improve the generalization error of the decision function, or since wewould like to reduce the computational cost of evaluating the decision function cf.

Load more