Statistical Learning and Kernel Metho ds
Bernhard Scholkopf
Microsoft Research Limited,
1 Guildhall Street, Cambridge CB2 3NH, UK
http://research.microsoft.com/bsc
February 29, 2000
Technical Rep ort
MSR-TR-2000-23
Microsoft Research
Microsoft Corp oration
One Microsoft Way
Redmond, WA98052
Lecture notes for a course to be taught at the Interdisciplinary College 2000,
Gunne, Germany, March2000.
Abstract
We brie y describ e the main ideas of statistical learning theory,sup-
port vector machines, and kernel feature spaces.
Contents
1 An Intro ductory Example 1
2 Learning Pattern Recognition from Examples 4
3 Hyp erplane Classi ers 5
4 Supp ort Vector Classi ers 8
5 Supp ort Vector Regression 11
6 Further Developments 14
7 Kernels 15
8 Representing Similarities in Linear Spaces 18
9 Examples of Kernels 21
10 Representating Dissimilarities in Linear Spaces 22
1 An Intro ductory Example
Supp ose wearegiven empirical data
x ;y ;:::;x ;y 2X f1g: 1
1 1 m m
Here, the domain X is some nonempty set that the patterns x are taken from;
i
the y are called labels or targets .
i
Unless stated otherwise, indices i and j will always be understo o d to run
over the training set, i.e. i; j =1;:::;m.
Note that wehave not made any assumptions on the domain X other than
it being a set. In order to study the problem of learning, we need additional
structure. In learning, wewant to b e able to generalize to unseen data p oints.
In the case of pattern recognition, this means that given some new pattern
x 2 X , we want to predict the corresp onding y 2 f1g. By this we mean,
lo osely sp eaking, that we cho ose y suchthatx; y is in some sense similar to
the training examples. Tothis end, we need similarity measures in X and in
f1g. The latter is easy,astwo target values can only b e identical or di erent.
For the former, we require a similarity measure
k : XX ! R;
0 0
x; x 7! k x; x ; 2
0
i.e., a function that, given two examples x and x , returns a real number char-
acterizing their similarity. For reasons that will b ecome clear later, the function
k is called a kernel [13,1,8].
Atyp e of similarity measure that is of particular mathematical app eal are
0 N
dot pro ducts. For instance, given two vectors x; x 2 R , the canonical dot
pro duct is de ned as
N
X
0 0
x x := x x : 3
i i
i=1
Here, x denotes the i-th entry of x.
i
The geometrical interpretation of this dot pro duct is that it computes the
0
cosine of the angle b etween the vectors x and x ,provided they are normalized
to length 1. Moreover, it allows computation of the length of a vector x as
p
x x , and of the distance b etween twovectors as the length of the di erence
vector. Therefore, b eing able to compute dot pro ducts amounts to b eing able
to carry out all geometrical constructions that can be formulated in terms of
angles, lenghts and distances.
Note, however, that wehave not made the assumption that the patterns live
in a dot pro duct space. In order to b e able to use a dot pro duct as a similarity
measure, we therefore rst need to emb ed them into some dot pro duct space F ,
N
which need not b e identical to R . Tothisend,we use a map
:X ! F
x 7! x: 4 1
The space F is called a featurespace . To summarize, emb edding the data into
F has three b ene ts.
1. It lets us de ne a similarity measure from the dot pro duct in F ,
0 0 0
k x; x :=x x = x x : 5
2. It allows us to deal with the patterns geometrically, and thus lets us study
learning algorithm using linear algebra and analytic geometry.
3. The freedom to cho ose the mapping will enable us to design a large
variety of learning algorithms. For instance, consider a situation where the
inputs already live in a dot pro duct space. In that case, we could directly
de ne a similarity measure as the dot pro duct. However, wemight still
cho ose to rst apply a nonlinear map to change the representation into
one that is more suitable for a given problem and learning algorithm.
We are now in the p osition to describ e a pattern recognition learning algo-
rithm that is arguably one of the simplest p ossible. The basic idea is to compute
the means of the two classes in feature space,
X
1
c = x ; 6
1 i
m
1
fi:y =+1g
i
X
1
c = x ; 7
2 i
m
2
fi:y = 1g
i
where m and m are the numb er of examples with p ositive and negative lab els,
1 2
resp ectively. We then assign a new p oint x to the class whose mean is closer to
it. This geometrical construction can b e formulated in terms of dot pro ducts.
Half-wayinbetween c and c lies the p oint c := c + c =2. We compute the
1 2 1 2
class of x bychecking whether the vector connecting c and x encloses an angle
smaller than =2 with the vector w := c c connecting the class means, in
1 2
other words
y = sgn x c w
y = sgn x c + c =2 c c
1 2 1 2
= sgn x c x c +b: 8
1 2
Here, wehave de ned the o set