<<

Statistical Learning and Metho ds

Bernhard Scholkopf

Microsoft Research Limited,

1 Guildhall Street, Cambridge CB2 3NH, UK

[email protected]

http://research.microsoft.com/bsc

February 29, 2000

Technical Rep ort

MSR-TR-2000-23

Microsoft Research

Microsoft Corp oration

One Microsoft Way

Redmond, WA98052

Lecture notes for a course to be taught at the Interdisciplinary College 2000,

Gunne, Germany, March2000.

Abstract

We brie y describ e the main ideas of statistical learning theory,sup-

port vector machines, and kernel feature spaces.

Contents

1 An Intro ductory Example 1

2 Learning from Examples 4

3 Hyp erplane Classi ers 5

4 Supp ort Vector Classi ers 8

5 Supp ort Vector Regression 11

6 Further Developments 14

7 Kernels 15

8 Representing Similarities in Linear Spaces 18

9 Examples of Kernels 21

10 Representating Dissimilarities in Linear Spaces 22

1 An Intro ductory Example

Supp ose wearegiven empirical data

x ;y ;:::;x ;y  2X f1g: 1

1 1 m m

Here, the domain X is some nonempty set that the patterns x are taken from;

i

the y are called labels or targets .

i

Unless stated otherwise, indices i and j will always be understo o d to run

over the training set, i.e. i; j =1;:::;m.

Note that wehave not made any assumptions on the domain X other than

it being a set. In order to study the problem of learning, we need additional

structure. In learning, wewant to b e able to generalize to unseen data p oints.

In the case of pattern recognition, this means that given some new pattern

x 2 X , we want to predict the corresp onding y 2 f1g. By this we mean,

lo osely sp eaking, that we cho ose y suchthatx; y  is in some sense similar to

the training examples. Tothis end, we need similarity measures in X and in

f1g. The latter is easy,astwo target values can only b e identical or di erent.

For the former, we require a similarity measure

k : XX ! R;

0 0

x; x  7! k x; x ; 2

0

i.e., a function that, given two examples x and x , returns a real number char-

acterizing their similarity. For reasons that will b ecome clear later, the function

k is called a kernel [13,1,8].

Atyp e of similarity measure that is of particular mathematical app eal are

0 N

dot pro ducts. For instance, given two vectors x; x 2 R , the canonical dot

pro duct is de ned as

N

X

0 0

x  x := x x  : 3

i i

i=1

Here, x denotes the i-th entry of x.

i

The geometrical interpretation of this dot pro duct is that it computes the

0

cosine of the angle b etween the vectors x and x ,provided they are normalized

to length 1. Moreover, it allows computation of the length of a vector x as

p

x  x , and of the distance b etween twovectors as the length of the di erence

vector. Therefore, b eing able to compute dot pro ducts amounts to b eing able

to carry out all geometrical constructions that can be formulated in terms of

angles, lenghts and distances.

Note, however, that wehave not made the assumption that the patterns live

in a dot pro duct space. In order to b e able to use a dot pro duct as a similarity

measure, we therefore rst need to emb ed them into some dot pro duct space F ,

N

which need not b e identical to R . Tothisend,we use a map

:X ! F

x 7! x: 4 1

The space F is called a featurespace . To summarize, emb edding the data into

F has three b ene ts.

1. It lets us de ne a similarity measure from the dot pro duct in F ,

0 0 0

k x; x :=x  x  = x  x : 5

2. It allows us to deal with the patterns geometrically, and thus lets us study

learning algorithm using linear algebra and analytic geometry.

3. The freedom to cho ose the mapping  will enable us to design a large

variety of learning algorithms. For instance, consider a situation where the

inputs already live in a dot pro duct space. In that case, we could directly

de ne a similarity measure as the dot pro duct. However, wemight still

cho ose to rst apply a nonlinear map  to change the representation into

one that is more suitable for a given problem and learning algorithm.

We are now in the p osition to describ e a pattern recognition learning algo-

rithm that is arguably one of the simplest p ossible. The basic idea is to compute

the means of the two classes in feature space,

X

1

c = x ; 6

1 i

m

1

fi:y =+1g

i

X

1

c = x ; 7

2 i

m

2

fi:y =1g

i

where m and m are the numb er of examples with p ositive and negative lab els,

1 2

resp ectively. We then assign a new p oint x to the class whose mean is closer to

it. This geometrical construction can b e formulated in terms of dot pro ducts.

Half-wayinbetween c and c lies the p oint c := c + c =2. We compute the

1 2 1 2

class of x bychecking whether the vector connecting c and x encloses an angle

smaller than =2 with the vector w := c c connecting the class means, in

1 2

other words

y = sgn x c  w

y = sgn x c + c =2  c c 

1 2 1 2

= sgn x  c  x  c +b: 8

1 2

Here, wehave de ned the o set



1

2 2

kc k kc k : 9 b :=

2 1

2

It will prove instructive to rewrite this expression in terms of the patterns

x in the input domain X . To this end, note that we do not have a dot pro duct

i

in X , all wehaveisthe similarity measure k cf. 5. Therefore, we need to 2

rewrite everything in terms of the kernel k evaluated on input patterns. Tothis

end, substitute 6 and 7 into 8 to get the decision function

0 1

X X

1 1

@ A

y = sgn x  x  x  x +b

i i

m m

1 2

fi:y =+1g fi:y =1g

i i

1 0

X X

1 1

A @

k x; x  k x; x +b : 10 = sgn

i i

m m

1 2

fi:y =+1g fi:y =1g

i i

Similarly, the o set b ecomes

0 1

X X

1 1 1

@ A

b := k x ;x  k x ;x  : 11

i j i j

2 2

2 m m

2 1

fi;j :y =y =1g fi;j :y =y =+1g

i j i j

Let us consider one well-known sp ecial case of this typ e of classi er. Assume

that the class means have the same distance to the origin hence b = 0, and

that k can b e viewed as a density, i.e. it is p ositive and has integral 1,

Z

0 0

k x; x dx =1 for all x 2X: 12

X

In order to state this assumption, we have to require that we can de ne an

integral on X .

If the ab ove holds true, then 10 corresp onds to the so-called Bayes deci-

sion b oundary separating the two classes, sub ject to the assumption that the

two classes were generated from two probability distributions that are correctly

estimated by the Parzen windows estimators of the two classes,

X

1

p x:= k x; x  13

1 i

m

1

fi:y =+1g

i

X

1

k x; x : 14 p x:=

i 2

m

2

fi:y =1g

i

Given some p oint x, the lab el is then simply computed bychecking which of the

two, p xorp x, is larger, which directly leads to 10. Note that this decision

1 2

is the b est we can do if wehave no prior information ab out the probabilities of

the two classes.

The classi er 10 is quite close to the typ es of learning machines that we

will b e interested in. It is linear in the feature space, while in the input domain,

it is represented byakernel expansion. It is example-based in the sense that

the kernels are centered on the training examples, i.e. one of the two arguments

of the kernels is always a training example. The main point where the more

sophisticated techniques to be discussed later will deviate from 10 is in the

selection of the examples that the kernels are centered on, and in the weight

that is put on the individual kernels in the decision function. Namely,itwillno 3

longer b e the case that al l training examples app ear in the kernel expansion,

and the weights of the kernels in the expansion will no longer b e uniform. In

the feature space representation, this statement corresp onds to saying that we

will study all normal vectors w of decision hyp erplanes that can b e represented

as linear combinations of the training examples. For instance, we mightwant

to remove the in uence of patterns that are very far away from the decision

b oundary, either since we exp ect that they will not improve the generalization

error of the decision function, or since wewould like to reduce the computational

cost of evaluating the decision function cf. 10. The hyp erplane will then only

dep end on a subset of training examples, called support vectors.

2 Learning Pattern Recognition from Examples

With the ab ove example in mind, let us now consider the problem of pattern

recognition in a more formal setting [27, 28], following the intro duction of [19].

In two-class pattern recognition, we seek to estimate a function

f : X !f1g 15

based on input-output training data 1. We assume that the data were gen-

erated indep endently from some unknown but xed probability distribution

P x; y . Our goal is to learn a function that will correctly classify unseen exam-

ples x; y , i.e. wewant f x=y for examples x; y  that were also generated

from P x; y .

If we put no restriction on the class of functions that we cho ose our esti-

mate f from, however, even a function which do es well on the training data,

e.g. by satisfying f x  = y for all i = 1;:::;m, need not generalize well to

i i

unseen examples. To see this, note that for each function f and any test set

N

x ; y ;:::;x ; y  2 R f1g; satisfying fx ;:::;x g\fx ;:::;x g = fg,

1 1 m m 1 m 1 m

 

there exists another function f suchthat f x =f x foralli =1;:::;m,

i i



yet f x  6= f x  for all i =1;:::;m . As we are only given the training data,

i i

wehave no means of selecting whichofthetwo functions and hence whichof

the completely di erent sets of test lab el predictions is preferable. Hence, only

minimizing the training error or empirical risk ,

m

X

1 1

R [f ]= jf x  y j; 16

emp i i

m 2

i=1

do es not imply a small test error called risk , averaged over test examples

drawn from the underlying distribution P x; y ,

Z

1

jf x y j dP x; y : 17 R [f ]=

2

Statistical learning theory [31,27,28,29], or VCVapnik-Chervonenkis theory,

shows that it is imp erative to restrict the class of functions that f is chosen 4

from to one which has a capacity that is suitable for the amount of available

training data. VC theory provides bounds on the test error. The minimization

of these b ounds, which dep end on b oth the empirical risk and the capacityof

the function class, leads to the principle of structural risk minimization [27].

The b est-known capacity concept of VC theory is the VC dimension,de nedas

the largest number h of p oints that can b e separated in all p ossible ways using

functions of the given class. An example of a VC bound is the following: if

h

can implement, then for all functions of that class, with a probabilityofatleast

1 ,thebound

log  h

R    R  + 18 ;

emp

m m

holds, where the con dence term  is de ned as

s



2m

h log +1 log =4

log  h

h

 = ; : 19

m m m

Tighter b ounds can b e formulated in terms of other concepts, suchasthe an-

nealedVC entropy or the Growth function. These are usually considered to b e

harder to evaluate, but they play a fundamental role in the conceptual part of

VC theory [28]. Alternative capacity concepts that can be used to formulate

b ounds include the fat shattering dimension [2].

The b ound 18 deserves some further explanatory remarks. Supp ose we

wanted to learn a \dep endency" where P x; y  = P x  P y , i.e. where the

pattern x contains no information ab out the lab el y , with uniform P y . Given a

training sample of xed size, we can then surely come up with a learning machine

whichachieves zero training error provided wehave no examples contradicting

each other. However, in order to repro duce the random lab ellings, this machine

will necessarily require a large VC dimension h. Thus, the con dence term

19, increasing monotonically with h, will be large, and the b ound 18 will

not supp ort p ossible hop es that due to the small training error, we should

exp ect a small test error. This makes it understandable how 18 can hold

indep endent of assumptions ab out the underlying distribution P x; y : it always

holds provided that h

| a b ound on an error rate b ecomes void if it is larger than the maximum error

rate. In order to get nontrivial predictions from 18, the function space must

be restricted such that the capacity e.g. VC dimension is small enough in

relation to the available amount of data.

3 Hyp erplane Classi ers

In the present section, we shall describ e a hyp erplane learning algorithm that

can be p erformed in a dot pro duct space such as the feature space that we

intro duced previously. As describ ed in the previous section, to design learning 5

algorithms, one needs to come up with a class of functions whose capacity can

b e computed.

[32] and [30] considered the class of hyp erplanes

N

w  x+b =0 w 2 R ;b 2 R; 20

corresp onding to decision functions

f x=sgnw  x+b; 21

and prop osed a learning algorithm for separable problems, termed the Gen-

eralized Portrait, for constructing f from empirical data. It is based on two

facts. First, among all hyp erplanes separating the data, there exists a unique

one yielding the maximum margin of separation b etween the classes,

N

max minfkx x k : x 2 R ; w  x+b =0;i =1;:::;mg: 22

i

w;b

Second, the capacity decreases with increasing margin.

To construct this Optimal Hyperplane cf. Figure 1, one solves the following

optimization problem:

1

2

minimize  w = kwk 23

2

sub ject to y  w  x +b  1; i =1;:::;m: 24

i i

This constrained optimization problem is dealt with by intro ducing Lagrange

multipliers  0 and a Lagrangian

i

m

X

1

2

y  x  w+b 1 : 25 Lw;b; = kwk

i i i

2

i=1

The Lagrangian L has to b e minimized with resp ect to the primal variables w

and b and maximized with resp ect to the dual variables i.e. a saddle p oint

i

has to b e found. Let us try to get some intuition for this. If a constraint 24

is violated, then y  w  x +b 1 < 0, in which case L can b e increased by

i i

increasing the corresp onding . At the same time, w and b will havetochange

i

such that L decreases. To prevent y  w  x +b 1 from b ecoming

i i i

arbitrarily large, the change in w and b will ensure that, provided the problem is

separable, the constraint will eventually b e satis ed. Similarly, one can under-

stand that for all constraints which are not precisely met as equalities, i.e. for

which y  w  x +b 1 > 0, the corresp onding must b e 0: this is the value

i i i

of that maximizes L. The latter is the statement of the Karush-Kuhn-Tucker

i

complementarity conditions of optimization theory [6].

The condition that at the saddle p oint, the derivatives of L with resp ect to

the primal variables must vanish,

@ @

Lw;b; =0; Lw;b; =0; 26

@b @ w 6 {x | (w . x) + b = +1} {x | (w . x) + b = −1} Note: ◆ . (w x1) + b = +1 ❍ . ❍ ◆ x yi = +1 (w x2) + b = −1 x 1 2 . ◆ => (w (x1−x2)) = 2 . w 2 y = −1 w => .(x −x ) = i ◆ (||w|| 1 2 ) ||w||

❍ ❍ | }

❍ {x (w . x) + b = 0

Figure 1: A binary classi cation toy problem: separate balls from diamonds.

The optimal hyperplane is orthogonal to the shortest line connecting the convex

hulls of the two classes dotted, and intersects it half-way between the two

classes. The problem b eing separable, there exists a weight vector w and a

threshold b such that y  w  x +b > 0i =1;:::;m. Rescaling w and b

i i

such that the points closest to the hyp erplane satisfy jw  x +bj = 1, we

i

obtain a canonical form w;b of the hyp erplane, satisfying y  w  x +b  1.

i i

Note that in this case, the margin, measured p erp endicularly to the hyp erplane,

equals 2=kwk. This can b e seen by considering two p oints x ; x on opp osite

1 2

sides of the margin, i.e. w  x +b =1; w  x +b = 1, and pro jecting them

1 2

onto the hyp erplane normal vector w=kwk.

leads to

m

X

y =0 27

i i

i=1

and

m

X

w = y x : 28

i i i

i=1

The solution vector thus has an expansion in terms of a subset of the training

patterns, namely those patterns whose is non-zero, called Support Vectors.

i

By the Karush-Kuhn-Tucker complementarity conditions

 [y x  w+b 1] = 0; i =1;:::;m; 29

i i i

the Supp ort Vectors lie on the margin cf. Figure 1. All remaining examples of

the training set are irrelevant: their constraint24doesnotplay a role in the

optimization, and they do not app ear in the expansion 28. This nicely captures

our intuition of the problem: as the hyp erplane cf. Figure 1 is completely 7

determined by the patterns closest to it, the solution should not dep end on the

other examples.

By substituting 27 and 28 into L, one eliminates the primal variables and

arrives at the Wolfe dual of the optimization problem [e.g. 6]: nd multipliers

which

i

m m

X X

1

maximize W  = y y x  x  30

i i j i j i j

2

i=1 i;j =1

m

X

sub ject to  0; i =1;:::;m; and y =0: 31

i i i

i=1

The hyp erplane decision function can thus b e written as

 !

m

X

f x = sgn y  x  x +b 32

i i i

i=1

where b is computed using 29.

The structure of the optimization problem closely resembles those that typ-

ically arise in Lagrange's formulation of mechanics. Also there, often only a

subset of the constraints b ecome active. For instance, if wekeep a ball in a b ox,

then it will typically roll into one of the corners. The constraints corresp onding

to the walls which are not touched by the ball are irrelevant, the walls could

just as well b e removed.

Seen in this light, it is not to o surprising that it is p ossible to givea me-

chanical interpretation of optimal margin hyp erplanes [9]: If we assume that

each supp ort vector x exerts a p erp endicular force of size and sign y on

i i i

a solid plane sheet lying along the hyp erplane, then the solution satis es the

requirements of mechanical stability. The constraint 27 states that the forces

on the sheet sum to zero; and 28 implies that the torques also sum to zero,

P

via x  y  w=kwk = w  w=kwk =0.

i i i

i

There are theoretical arguments supp orting the go o d generalization p erfor-

mance of the optimal hyp erplane [31,27,35,4]. In addition, it is computation-

ally attractive, since it can b e constructed by solving a quadratic programming

problem.

4 Supp ort Vector Classi ers

We now have all the to ols to describ e supp ort vector machines [28, 19, 26].

Everything in the last section was formulated in a dot pro duct space. We think

of this space as the feature space F describ ed in Section 1. To express the

formulas in terms of the input patterns living in X ,wethus need to employ5,

0

which expresses the dot pro duct of bold face feature vectors x; x in terms of

0

the kernel k evaluated on input patterns x; x ,

0 0

k x; x =x  x : 33 8 input space feature space

❍ ◆ ◆ ◆ ◆ Φ ❍ ❍ ❍ ❍

Figure 2: The idea of SV machines: map the training data into a higher-

dimensional feature space via , and construct a separating hyp erplane with

maximum margin there. This yields a nonlinear decision b oundary in input

space. By the use of a kernel function 2, it is p ossible to compute the separat-

ing hyp erplane without explicitly carrying out the map into the feature space.

This can be done since all feature vectors only o ccured in dot pro ducts. The

weightvector cf. 28 then b ecomes an expansion in feature space, and will

thus typically no more corresp ond to the image of a single vector from input

space. Wethus obtain decision functions of the more general form cf. 32

 !

m

X

f x = sgn y  x  x  + b

i i i

i=1

 !

m

X

= sgn y  k x; x +b ; 34

i i i

i=1

and the following quadratic program cf. 30:

m m

X X

1

maximize W  = y y k x ;x  35

i i j i j i j

2

i=1 i;j =1

m

X

sub ject to  0; i =1;:::;m; and y =0: 36

i i i

i=1

In practice, a separating hyp erplane may not exist, e.g. if a high noise level

causes a large overlap of the classes. To allow for the p ossibility of examples

violating 24, one intro duces slackvariables [10,28,22]

  0; i =1;:::;m 37

i

in order to relax the constraints to

y  w  x +b  1  ; i =1;:::;m: 38

i i i

A classi er which generalizes wellisthenfoundbycontrolling b oth the classi er

P

capacityvia kwk and the sum of the slacks  . The latter is done as it can

i

i 9

Figure 3: Example of a Supp ort Vector classi er found by using a radial basis

0 0 2

function kernel k x; x  = expkx x k . Both co ordinate axes range from -1

to +1. Circles and disks are two classes of training examples; the middle line is

the decision surface; the outer lines precisely meet the constraint 24. Note that

the Supp ort Vectors found by the algorithm marked by extra circles are not

centers of clusters, but examples which are critical for the given classi cation

P

m

y  k x; x +b of task. Grey values co de the mo dulus of the argument

i i i

i=1

the decision function 34.

be shown to provide an upp er bound on the number of training errors which

leads to a convex optimization problem.

One p ossible realization of a soft margin classi er is minimizing the ob jective

function

m

X

1

2

kwk + C  39  w ;  =

i

2

i=1

sub ject to the constraints 37 and 38, for some value of the constant C>0

determining the trade-o . Here and below, we use b oldface greek letters as

a shorthand for corresp onding vectors  = ;:::; . Incorp orating kernels,

1 m

and rewriting it in terms of Lagrange multipliers, this again leads to the problem

of maximizing 35, sub ject to the constraints

m

X

y =0: 40 0   C; i =1;:::;m; and

i i i

i=1

The only di erence from the separable case is the upp er bound C on the La-

grange multipliers . This way, the in uence of the individual patterns which

i 10 ξ x +ε ξ x 0 x x x x −ε x x x x x x x x −ε +ε

x

Figure 4: In SV regression, a tub e with radius " is tted to the data. The

trade-o b etween mo del complexity and p oints lying outside of the tub e with

p ositive slackvariables   is determined by minimizing 46.

could b e outliers gets limited. As ab ove, the solution takes the form 34. The

threshold b can be computed by exploiting the fact that for all SVs x with

i

i i

Tucker complementarity conditions, and hence

m

X

y  k x ;x +b = y : 41

j j i j i

j =1

Another p ossible realization of a soft margin variant of the optimal hyp er-

plane uses the  -parametrization [22]. In it, the paramter C is replaced bya

parameter  2 [0; 1] which can b e shown to lower and upp er b ound the number

of examples that will b e SVs and that will come to lie on the wrong side of the

hyp erplane, resp ectively. It uses a primal ob jective function with the error term

P

1

 , and separation constraints

i

i

m

y  w  x +b    ; i =1;:::;m: 42

i i i

The margin parameter  is a variable of the optimization problem. The dual

can be shown to consist of maximizing the quadratic part of 35, sub ject to

P P

0   1=m, y = 0 and the additional constraint =1.

i i i i

i i

5 Supp ort Vector Regression

The concept of the margin is sp eci c to pattern recognition. To generalize

the SV algorithm to regression estimation [28], an analogue of the margin is

constructed in the space of the target values y note that in regression, wehave

y 2 R by using Vapnik's "-insensitive loss function Figure 4

jy f xj := maxf0; jy f xj"g: 43

" 11

To estimate a

f x=w  x+b 44

with precision ", one minimizes

m

X

1

2

kwk + C jy f x j : 45

i i "

2

i=1

Written as a constrained optimization problem, this reads:

m

X

1



2 

minimize  w ;  ;  = kwk + C  +   46

i

i

2

i=1

sub ject to w  x +b y  " +  47

i i i



y w  x +b  " +  48

i i

i



 ;  0 49

i

i

for all i =1;:::;m. Note that according to 47 and 48, any error smaller than



" do es notrequireanonzero  or  , and hence do es not enter the ob jective

i

i

function 46.

Generalization to kernel-based regression estimation is carried out in com-

plete analogy to the case of pattern recognition. Intro ducing Lagrange multi-

pliers, one thus arrives at the following optimization problem: for C>0;"  0

chosen a priori,

m m

X X

  

y + +  maximize W  ; =" 

i i i

i i

i=1 i=1

m

X

1

 

k x ;x  50  

j i j i

j i

2

i;j =1

m

X

 

=0:51  C; i =1;:::;m; and  sub ject to 0  ;

i i

i i

i=1

The regression estimate takes the form

m

X



f x=  k x ;x+b; 52

i i

i

i=1

where b is computed using the fact that 47 b ecomes an equality with  =0 if

i

 

0 <

i

i i

Several extensions of this algorithm are p ossible. From an abstract p oint

of view, we just need some target function which dep ends on the vector w;  

cf. 46. There are multiple degrees of freedom for constructing it, including

some freedom how to p enalize, or regularize, di erent parts of the vector, and

some freedom how to use the kernel trick. For instance, more general loss 12 σ σ Σ υ ()Σ output ( i k (x,xi))

υ υ . . . υ weights 1 2 m

...... Φ .Φ ( ) ( ) ( ) dot product ( (x) (xi)) = k (x,xi)

Φ Φ Φ Φ Φ Φ (x 1) (x 2) (x) (x n) mapped vectors (x i), (x)

. . . support vectors x1 ... xn

test vector x

Figure 5: Architecture of SV machines. The input x and the Supp ort Vectors x

i

are nonlinearly mapp ed byinto a feature space F , where dot pro ducts are

computed. By the use of the kernel k , these twolayers are in practice computed

in one single step. The results are linearly combined by weights  , found by

i

solving a quadratic program in pattern recognition,  = y ; in regression

i i i



estimation,  = . The linear combination is fed into the function  in

i i

i

pattern recognition,  x = sgn x + b; in regression estimation,  x=x + b.

functions can b e used for  , leading to problems that can still b e solved eciently

[24]. Moreover, norms other than the 2-norm k:k can b e used to regularize the

solution. Yet another example is that p olynomial kernels can b e incorp orated

which consist of multiple layers, such that the rst layer only computes pro ducts

within certain sp eci ed subsets of the entries of w [17].

Finally, the algorithm can b e mo di ed suchthat " need not b e sp eci ed a

priori. Instead, one sp eci es an upp er b ound 0    1 on the fraction of

points allowed to lie outside the tub e asymptotically, the numb er of SVs and

the corresp onding " is computed automatically. This is achieved by using as

primal ob jective function

 !

m

X

1

2

kwk + C m" + jy f x j 53

i i "

2

i=1

instead of 45, and treating "  0 as a parameter that we minimize over [22]. 13

6 Further Developments

Having describ ed the basics of SV machines, wenow summarize some empirical

ndings and theoretical developments whichwere to follow.

By the use of kernels, the optimal margin classi er was turned into a classi er

which b ecame a serious comp etitor of high-p erformance classi ers. Surprisingly,

it was noticed that when di erentkernel functions are used in SV machines, they

empirically lead to very similar classi cation accuracies and SV sets [18]. In this

sense, the SV set seems to characterize or compress  the given task in a manner

which up to a certain degree is indep endentofthetyp e of kernel i.e. the typ e

of classi er used.

Initial work at AT&T Bell Labs fo cused on OCR optical character recog-

nition, a problem where the two main issues are classi cation accuracy and

classi cation sp eed. Consequently, some e ort went into the improvement of

SV machines on these issues, leading to the Virtual SV metho d for incorp orat-

ing prior knowledge ab out transformation invariances by transforming SVs, and

the ReducedSet metho d for sp eeding up classi cation. This way, SV machines

b ecame comp etitive with the b est available classi ers on b oth OCR and ob ject

recognition tasks [7,9,17].

Another initial weakness of SV machines, less apparent in OCR applications

which are characterized bylow noise levels, was that the size of the quadratic

programming problem scaled with the number of Supp ort Vectors. This was

due to the fact that in 35, the quadratic part contained at least all SVs |

the common practice was to extract the SVs by going through the training data

in chunks while regularly testing for the p ossibility that some of the patterns

that were initially not identi ed as SVs turn out to b ecome SVs at a later stage

note that without chunking, the size of the matrix would b e m  m,wherem

is the numb er of all training examples. What happ ens if wehave a high-noise

problem? In this case, many of the slackvariables  will b ecome nonzero, and

i

all the corresp onding examples will b ecome SVs. For this case, a decomp osition

algorithm was prop osed [14], which is based on the observation that not only

can weleave out the non-SV examples i.e. the x with = 0 from the current

i i

chunk, but also some of the SVs, esp ecially those that hit the upp er b oundary

i.e. = C . In fact, one can use chunks which do not even contain all SVs,

i

and maximize over the corresp onding sub-problems. SMO [15,25,20] explores

an extreme case, where the sub-problems are chosen so small that one can

solve them analytically. Several public domain SV packages and optimizers are

listed on the web page http://www.kernel-machines.org. For more details on

the optimization problem, see [19].

On the theoretical side, the least understo o d part of the SV algorithm ini-

tially was the precise role of the kernel, and how a certain kernel choice would

in uence the generalization ability. In that resp ect, the connection to regular-

ization theory provided some insight. For kernel-based function expansions, one

can show that given a regularization op erator P mapping the functions of the

learning machine into some dot pro duct space, the problem of minimizing the 14

regularized risk



2

R [f ]= R [f ]+ kPfk 54

reg emp

2

with a regularization parameter   0 can b e written as a constrained opti-

mization problem. For particular choices of the loss function, it further reduces

to a SV typ e quadratic programming problem. The latter thus is not sp eci c to

SV machines, but is common to a much wider class of approaches. What gets

lost in the general case, however, is the fact that the solution can usually b e ex-

pressed in terms of a small numb er of SVs. This sp eci c feature of SV machines

is due to the fact that the typ e of regularization and the class of functions that

the estimate is chosen from are intimately related [11,23]: the SV algorithm is

equivalent to minimizing the regularized risk on the set of functions

X

f x= k x ;x+b; 55

i i

i

provided that k and P are interrelated by

k x ;x =Pkx ;:  Pkx ;: : 56

i j i j



To this end, k is chosen as a Green's function of P P , for in that case, the right

hand side of 56 equals



k x ;:  P Pkx ;:=k x ;:   : = k x ;x : 57

i j i x i j

j

For instance, a Gaussian RBF kernel thus corresp onds to regularization with a

functional containing a sp eci c di erential op erator.

In SV machines, the kernel thus playsadualrole: rstly, it determines the

class of functions 55 that the solution is taken from; secondly, via 56, the

kernel determines the typ e of regularization that is used.

We conclude this section by noticing that the kernel metho d for computing

dot pro ducts in feature spaces is not restricted to SV machines. Indeed, it has

b een p ointed out that it can b e used to develop nonlinear generalizations of any

algorithm that can b e cast in terms of dot pro ducts, such as principal comp onent

analysis [21], and a numb er of developments have followed this example.

7 Kernels

Wenowtake a closer lo ok at the issue of the similarity measure, or kernel, k .

N

In this section, wethinkofX as a subset of the vector space R ,N 2 N ,

endowed with the canonical dot pro duct 3.

7.1 Pro duct Features

N

Supp ose we are given patterns x 2 R where most information is contained in

the d-th order pro ducts monomials of entries [x] of x,

j

[x]  :::  [x] ; 58

j j

1

d 15

where j ;:::;j 2 f1;:::;Ng. In that case, we might prefer to extract these

1 d

pro duct features, and work in the feature space F of all pro ducts of d entries.

In visual recognition problems, where images are often represented as vectors,

this would amount to extracting features which are pro ducts of individual pixels.

2

For instance, in R ,we can collect all monomial feature extractors of degree

2 in the nonlinear map

2 3

:R ! F = R 59

2 2

[x] ; [x]  7! [x] ; [x] ; [x] [x] : 60

1 2 1 2

1 2

This approach works ne for small toy examples, but it fails for realistically

sized problems: for N -dimensional input patterns, there exist

N + d 1!

N = 61

F

d!N 1!

di erent monomials 58, comprising a feature space F of dimensionality N .

F

For instance, already 16  16 pixel input images and a monomial degree d =5

10

yield a dimensionalityof10 .

In certain cases describ ed b elow, there exists, however, a wayof computing

dot products in these high-dimensional feature spaces without explicitely map-

N

ping into them: by means of kernels nonlinear in the input space R . Thus, if

the subsequent pro cessing can b e carried out using dot pro ducts exclusively,we

are able to deal with the high dimensionality.

The following section describ es how dot pro ducts in p olynomial feature

spaces can b e computed eciently.

7.2 Polynomial Feature Spaces Induced by Kernels

0

In order to compute dot pro ducts of the form x  x , we employkernel

representations of the form

0 0

k x; x =x  x ; 62

whichallow us to compute the value of the dot pro duct in F without having to

carry out the map . This metho d was used by [8] to extend the Generalized

Portrait hyp erplane classi er of [31] to nonlinear Supp ort Vector machines. In

[1], F is termed the linearization space, and used in the context of the p otential

function classi cation metho d to express the dot pro duct b etween elements of

F in terms of elements of the input space.

What do es k lo ok like for the case of p olynomial features? We start by

giving an example [28]forN = d =2. For the map

2 2

C :[x] ; [x]  7! [x] ; [x] ; [x] [x] ; [x] [x] ; 63

2 1 2 1 2 2 1

1 2

dot pro ducts in F take the form

0 2 0 2 2 0 2 0 0 0 2

C x  C x  = [x] [x ] +[x] [x ] +2[x] [x] [x ] [x ] =x  x  ; 64

2 2 1 2 1 2

1 1 2 2 16

i.e. the desired kernel k is simply the square of the dot pro duct in input space.

The same works for arbitrary N; d 2 N [8]: as a straightforward generalization

of a result proved in the context of p olynomial approximation [16, Lemma 2.1],

wehave:

N

Prop osition 1 De ne C to map x 2 R to the vector C x whose entries

d d

areallpossible d-th degreeorderedproducts of the entries of x. Then the corre-

sponding kernel computing the dot product of vectors mappedbyC is

d

0 0 0 d

k x; x =C x  C x  = x  x  : 65

d d

Pro of. We directly compute

N

X

0 0 0

[x]  :::  [x]  [x ]  :::  [x ] 66 C x  C x  =

j j j j d d

1 1

d d

j ;:::;j =1

1

d

0 1

d

N

X

0 0 d

@ A

= [x]  [x ] =x  x  : 67

j j

j =1

Instead of ordered pro ducts, we can use unordered ones to obtain a map

 which yields the same value of the dot pro duct. To this end, we have to

d

comp ensate for the multiple o ccurence of certain monomials in C by scaling

d

the resp ectiveentries of  with the square ro ots of their numb ers of o ccurence.

d

Then, by this de nition of  , and 65,

d

0 0 0 d

 x   x  = C x  C x =x  x  : 68

d d d d

For instance, if n of the j in 58 are equal, and the remaining ones are di erent,

i

p

d n + 1! [for then the co ecient in the corresp onding comp onentof is

d

the general case, cf. 23]. For  , this simply means that [28]

2

p

2 2

 x=[x] 2[x] [x] : 69 ; [x] ;

2 1 2

1 2

If x represents an image with the entries b eing pixel values, we can use

0 d

the kernel x  x  to work in the space spanned by pro ducts of any d pixels |

provided that we are able to do our work solely in terms of dot pro ducts, without

any explicit usage of a mapp ed pattern  x. Using kernels of the form 65, we

d

takeinto account higher-order statistics without the combinatorial explosion cf.

61 of time and memory complexitywhich go es along already with mo derately

high N and d.

To conclude this section, note that it is p ossible to mo dify 65 suchthatit

maps into the space of all monomials up to degree d, de ning [28]

0 0 d

k x; x =x  x +1 : 70 17

8 Representing Similarities in Linear Spaces

In what follows, we will lo ok at things the other way round, and start with the

kernel. Given some kernel function, can we construct a feature space suchthat

the kernel computes the dot pro duct in that feature space? This question has

b een brought to the attention of the communityby[1,8,28].

In functional analysis, the same problem has b een studied under the heading of

Hilbert space representations of kernels. A go o d monograph on the functional

analytic theory of kernels is [5]; indeed, a large part of the material in the present

section is based on that work.

There is one more asp ect in which this section di ers from the previous one:

the latter dealt with vectorial data. The results in the current section, in con-

trast, hold for data drawn from domains which need no additional structure

other than them b eing nonempty sets X . This generalizes kernel learning algo-

rithms to a large number of situations where a vectorial representation is not

readily available [17,12,34].

We start with some basic de nitions and results.

De nition 2  Given a kernel k and patterns x ;:::;x 2X,

1 m

the m  m matrix

K := k x ;x  71

i j ij

is cal led the Gram matrix or kernel matrixof k with respect to x ;:::;x .

1 m

De nition 3 Positive matrix An m  m matrix K satisfying

ij

X

c c K  0 72

i j ij

i;j

1

for al l c 2 C is cal led p ositive.

i

De nition 4 Positive de nite kernel Let X be a nonempty set. A func-

tion k : XX ! C which for al l m 2 N ;x 2X gives rise to a positive Gram

i

matrix is cal leda p ositive de nite kernel. Often, we shal l refer to it simply as

a kernel.

The term kernel stems from the rst use of this typ e of function in the study

of integral op erators. A function k whichgives rise to an op erator T via

Z

0 0 0

Tfx= k x; x f x  dx 73

X

is called the kernel of T . One might argue that the term positive de nite kernel

is slightly misleading. In matrix theory, the term de nite is usually used to

denote the case where equality in 72 only o ccurs if c = ::: = c =0. Simply

1 m

using the term positive kernel, on the other hand, could be confused with a

kernel whose values are p ositive. In the literature, a numb er of di erent terms

1

The bar in c denotes complex conjugation.

j 18

are used for p ositive de nite kernels, suchasreproducing kernel, Mercer kernel,

or support vector kernel.

The de nitions for p ositive de nite kernels and p ositive matrices di er only

in the fact that in the former case, we are free to cho ose the p oints on which

the kernel is evaluated.

Positive de nitness implies positivity on the diagonal,

k x ;x   0 for all x 2X; 74

1 1 1

use m = 1 in 72, and symmetry,i.e.

k x ;x  : 75 k x ;x =

j i i j

Note that in the complex-valued case, our de nition of symmetry includes com-

plex conjugation, depicted by the bar. The de nition of symmetry of matrices



is analogous, i.e. K = K .

ij ji

Obviously, real-valued kernels, which are what wewillmainlybeconcerned

with, are contained in the ab ove de nition as a sp ecial case, since wedidnot

require that the kernel take values in C n R. However, it is not sucient to

require that 72 hold for real co ecients c . If wewant to get away with real

i

co ecients only,we additionally have to require that the kernel b e symmetric,

k x ;x =k x ;x : 76

i j j i

It can b e shown that whenever k is a complex-valued p ositive de nite kernel,

its real part is a real-valued p ositive de nite kernel.

Kernels can b e regarded as generalized dot pro ducts. Indeed, any dot pro d-

uct can be shown to be a kernel; however, linearity do es not carry over from

dot pro ducts to general kernels. Another prop erty of dot pro ducts, the Cauchy-

Schwarz inequality,doeshave a natural generalization to kernels:

Prop osition 5 If k isapositive de nite kernel, and x ;x 2X, then

1 2

2

jk x ;x j  k x ;x   k x ;x : 77

1 2 1 1 2 2

Pro of. For sake of brevity,wegive a non-elementary pro of using some basic

facts of linear algebra. The 2  2 Gram matrix with entries K = k x ;x is

ij i j

p ositive. Hence b oth its eigenvalues are nonnegative, and so is their pro duct,

K 's determinant, i.e.

2



0  K K K K = K K K K = K K jK j : 78

11 22 12 21 11 22 12 12 11 22 12

Substituting k x ;x  for K ,we get the desired inequality.

i j ij

We are nowin a p osition to construct the feature space asso ciated with a

kernel k . 19

We de ne a map from X into the space of functions mapping X into C ,

X

denoted as C ,via

X

:X ! C

x 7! k :; x: 79

0

Here, x = k :; x denotes the function that assigns the value k x ;x to

0

x 2X.

We havethus turned each pattern into a function on the domain X . In a

sense, a pattern is now represented by its similaritytoal l other p oints in the

input domain X . This seems a very rich representation, but it will turn out that

the kernel allows the computation of the dot pro duct in that representation.

We shall now construct a dot pro duct space containing the images of the

input patterns under . To this end, we rst need to endow it with the linear

structure of a vector space. This is done by forming linear combinations of the

form

m

X

f := k :; x : 80

i i

i=1

Here, m 2 N , 2 C and x 2X are arbitrary.

i i

Next, we de ne a dot pro duct b etween f and another function

0

m

X

0

k :; x  81 g :=

j

j

j =1

0 0

m 2 N , 2 C and x 2Xas

j

j

0

m m

X X

0

: 82 hf; gi :=  k x ;x

i j i

j

i=1 j =1

To see that this is well-de ned, although it explicitly contains the expansion

co ecients which need not b e unique, note that

0

m

X

0

hf; gi = f x ; 83

j

j

j =1

0

0

;x = using k x . The latter, however, do es not dep end on the partic- k x ;x

i i

j

j

ular expansion of f . Similarly,forg ,notethat

m

X

 g x  : 84 hf; gi =

i i

i=1

The last two equations also showthat h:; :i is antilinear in the rst argument

and linear in the second one. It is symmetric, as hf; gi = hg; f i. Moreover, given

functions f ;:::;f , and co ecients ;:::; 2 C , wehave

1 n 1 n

* +

n

X X X

 hf ;f i = f ; f  0; 85

i j i j i i i i

i;j =1 i i 20

hence h:; :i is actually a p ositive de nite kernel on our function space.

For the last step in proving that it even is a dot pro duct, we will use the

following interesting prop ertyof, whichfollows directly from the de nition:

for all functions 80, wehave

hk :; x;fi = f x 86

| k is the representer of evaluation. In particular,

0 0

hk :; x;k:; x i = k x; x : 87

By virtue of these prop erties, p ositivekernels k are also called reproducing ker-

nels [3, 5,33,17].

By 86 and Prop osition 5, wehave

2 2

jf xj = jhk :; x;fij  k x; x hf; f i: 88

Therefore, hf; f i = 0 directly implies f =0,which is the last prop ertythatwas

left to prove in order to establish that h:; :i is a dot pro duct.

One can complete the space of functions 80 in the norm corresp onding to

the dot pro duct, i.e. add the limit points of sequences that are convergentin

that norm, and thus gets a Hilb ert space H , usually called a reproducing kernel

2

Hilbert space.

The case of real-valued kernels is included in the ab ove; in that case, H can

be chosen as a real Hilb ert space.

9 Examples of Kernels

Besides 65, [8] and [28] suggest the usage of Gaussian radial basis function

kernels [1]

0 2

kx x k

0

k x; x =exp 89

2

2 

and sigmoid kernels

0 0

k x; x  = tanhx  x +: 90

Note that all these kernels havethe convenient prop erty of unitary invari-

0 0 > 1

ance, i.e. k x; x =k Ux;Ux ifU = U if we consider complex numbers,

 >

then U instead of U has to b e used.

The radial basis function kernel additionally is translation invariant. More-

over, as it satis es k x; x =1 for all x 2 X , each mapp ed example has unit

0 0

length, kxk =1. In addition, as k x; x  > 0 for all x; x 2X,allpoints lie

inside the same orthant in feature space. To see this, recall that for unit lenght

vectors, the dot pro duct 3 equals the cosine of the enclosed angle. Hence

0 0 0

cos\x; x  = x  x  = k x; x  > 0; 91

2

A Hilb ert space H is de ned as a complete dot pro duct space. Completeness means that

all sequences in H which are convergent in the norm corresp onding to the dot pro duct will

actually have their limits in H ,too. 21

which amounts to saying that the enclosed angle between any two mapp ed

examples is smaller than =2.

The examples given so far apply to the case of vectorial data. Let us at least

give one example where X is not a vector space.

Example 6 Similarity of probabilistic events If A is a  -algebra, and

P aprobability measureonA,then

k A; B =P A \ B  P AP B  92

isapositive de nite kernel.

Further examples include kernels for string matching, as prop osed by[34,12].

10 Representating Dissimilarities in Linear Spaces

Wenowmove on to a larger class of kernels. It is interesting in several regards.

First, it will turn out that some kernel algorithms work with this larger class

of kernels, rather than only with p ositive de nite kernels. Second, their rela-

tionship to p ositive de nite kernels is a rather interesting one, and a number

of connections b etween the two classes provide understanding of kernels in gen-

eral. Third, they are intimately related to a question whichisavariation on the

central asp ect of p ositive de nite kernels: the latter can b e thoughtof as dot

pro ducts in feature spaces; the former, on the other hand, can b e emb edded as

distancemeasures arising from norms in feature spaces.

The following de nition di ers only in the additional constraint on the sum

of the c from De nition 3.

i

De nition 7 Conditionally p ositive matrix A symmetric m  m matrix

K m  2 satisfying

ij

m

X

c c K  0 93

i j ij

i;j =1

for al l c 2 C with

i

m

X

c =0 94

i

i=1

is cal led conditionally p ositive.

De nition 8 Conditionally p ositive de nite kernel Let X be a nonemp-

ty set. A function k : X X ! C which for al l m  2;x 2 X gives rise to

i

a conditional ly positive Gram matrix is cal led a conditionally p ositive de nite

kernel.

The de nitions for the real-valued case lo ok exactly the same. Note that

symmetry is required, also in the complex case. Due to the additional constraint

on the co ecients c , it do es not follow automatically anymore.

i 22

It is trivially true that whenever k is p ositive de nite, it is also conditionally

p ositive de nite. However, the latter is strictly weaker: if k is conditionally

p ositive de nite, and b 2 C ,thenk + b is also conditionally p ositive de nite. To

P

see this, simply apply the de nition to get, for c =0,

i

i

2

X X X X

c c k x ;x +b= c c k x ;x +b c = c c k x ;x   0:

i j i j i j i j i i j i j

i;j i;j i i;j

95

A standard example of a conditionally p ositive de nite kernel whichisnot

p ositive de nite is

0 0 2

k x; x =kx x k ; 96

0

where x; x 2X,andX is a dot pro duct space.

To see this, simply compute, for some pattern set x ;:::;x ,

1 m

X X

2

c c k x ;x  = c c kx x k 97

i j i j i j i j

i;j i;j

X



2 2

= c c kx k + kx k 2x  x 

i j i j i j

i;j

X X X X X

2 2

= c c kx k c c kx k +2 c c x  x 

i j j j i i i j i j

i j j i i;j

X

= 2 c c x  x   0; 98

i j i j

i;j

0 0

where the last line follows from 94 and the fact that k x; x  =x  x  is a

p ositive de nite kernel. Note that without 94, 97 can also b e negative e.g.,

put c = ::: = c = 1, hence the kernel is not a p ositive de nite one.

1 m

Without pro of, we add that in fact,

0 0

k x; x =kx x k 99

is conditionally p ositive de nite for 0 <  2.

Let us consider the kernel 96, which can b e considered the canonical con-

ditionally p ositivekernel on a dot pro duct space, and see howitis related to

the dot pro duct. Clearly, the distance induced by the norm is invariantunder

translations, i.e.

0 0

kx x k = kx x  x x k 100

0 0

0 0

for all x; x ;x 2X. In other words, even complete knowledge of kx x k for

0

0

all x; x 2X would not completely determine the underlying dot pro duct, the

reason b eing that the dot pro duct is not invariant under translations. Therefore,

one needs to x an origin x when going from the distance measure to the dot

0

0

pro duct. To this end, we need to write the dot pro duct of x x and x x in

0 0

terms of distances:

0 0 2 0

x x   x x  = x  x +kx k +x  x +x  x 

0 0 0 0 0



1

0 2 2 0 2

= kx x k + kx x k + kx x k 101

0 0

2 23

By construction, this will always result in a p ositive de nite kernel: the dot

pro duct is a p ositive de nite kernel, and we have only translated the inputs.

We have thus established the connection between 96 and a class of p ositive

de nite kernels corresp onding to the dot pro duct in di erent co ordinate systems,

related to each other by translations. In fact, a similar connection holds for a

wide class of kernels:

Prop osition 9 Let x 2X, and let k be a symmetric kernel on XX, satis-

0

fying k x ;x   0. Then

0 0

0 0 0

~

k x; x :=k x; x  k x; x  k x ;x  102

0 0

is positive de nite if and only if k is conditional ly positive de nite.

This result can b e generalized to k x ;x  < 0. In this case, we simply need

0 0

to add k x ;x  on the righthandsideof102. This is necessary, for otherwise,

0 0

~

wewould have k x ;x  < 0, contradicting 74. Without pro of, we state that

0 0

it is also sucient.

Using this result, one can prove another interesting connection b etween the

two classes of kernels:

Prop osition 10 A kernel k is conditional ly positive de nite if and only if

exptk  is positive de nite for al l t>0.

Positive de nite kernels of the form exptk  t > 0 have the interesting

prop erty that their n-th ro ot n 2 N  is again a p ositive de nite kernel. Such

kernels are called in nitely divisible. One can show that, disregarding some

technicalities, the logarithm of an in nitely divisible p ositive de nite kernel

+

is a conditionally p ositive de nite kernel. mapping into R

0

Conditionally p ositive de nite kernels are a natural choice whenever we are

dealing with a translation invariant problem, such as the supp ort vector ma-

chine: maximization of the margin of separarion b etween two classes of data is

indep endent of the origin's p osition. This can b e seen from the dual optimiza-

P

m

y = 0 pro jects out the same subspace tion problem 36: the constraint

i i

i=1

as 94 in the de nition of conditionally p ositive matrices [17,23].

Wehave seen that p ositivede nitekernels and conditionally p ositive de nite

kernels are closely related to each other. The former can b e represented as dot

pro ducts in Hilb ert spaces. The latter, it turns out, essentially corresp ond to

distance measures asso ciated with norms in Hilb ert spaces:

Prop osition 11 Let k beareal-valuedconditional ly positive de nite kernel on

X , satisfying k x; x=0 for al l x 2X. Then there exists a Hilbert space H of

real-valued functions on X , and a mapping :X ! H , such that

0 0 2

k x; x =kx x k : 103

There exist generalizations to the case where k x; x 6=0andwherek maps into

C . In these cases, the representation lo oks slightly more complicated. 24

The signi cance of this prop osition is that using conditionally p ositive def-

inite kernels, we can thus generalize all algorithms based on distances to cor-

resp onding algorithms op erating in feature spaces. This is an analogue of the

kernel trick for distances rather than dot pro ducts, i.e. dissimilarities rather

than similarities.

Acknowledgements. Thanks to A. Smola and R. Williamson for discussions,

and to C. Watkins for p ointing out, in his NIPS'99 SVM workshop talk, that

distances and dot pro ducts di er in the way they deal with the origin.

References

[1] M. A. Aizerman, E. M. Braverman, and L. I. Rozono er. Theoretical foundations

of the p otential function metho d in pattern recognition learning. Automation and

Remote Control, 25:821{837, 1964.

[2] N. Alon, S. Ben-David, N. Cesa-Bianchi, and D. Haussler. Scale{sensitive Dimen-

sions, Uniform Convergence, and Learnability. Journal of the ACM, 444:615{

631, 1997.

[3] N. Aronsza jn. Theory of repro ducing kernels. Trans. Amer. Math. Soc., 68:337{

404, 1950.

[4] P. L. Bartlett and J. Shawe-Taylor. Generalization p erformance of supp ort vector

machines and other pattern classi ers. In B. Scholkopf, C. J. C. Burges, and A. J.

Smola, editors, Advances in Kernel Methods |Support Vector Learning, pages

43{54, Cambridge, MA, 1999. MIT Press.

[5] C. Berg, J.P.R. Christensen, and P. Ressel. Harmonic Analysis on Semigroups.

Springer-Verlag, New York, 1984.

[6] D. P. Bertsekas. Nonlinear Programming.Athena Scienti c, Belmont, MA, 1995.

[7] V. Blanz, B. Scholkopf, H. Bultho ,  C. Burges, V. Vapnik, and T. Vetter. Com-

parison of view-based ob ject recognition algorithms using realistic 3D mo dels. In

C. von der Malsburg, W. von Seelen, J. C. Vorbruggen,  and B. Sendho , editors,

Arti cial Neural Networks | ICANN'96, pages 251 { 256, Berlin, 1996. Springer

Lecture Notes in Computer Science, Vol. 1112.

[8] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal

margin classi ers. In D. Haussler, editor, Proceedings of the 5th Annual ACM

Workshop on Computational Learning Theory, pages 144{152, Pittsburgh, PA,

July 1992. ACM Press.

[9] C. J. C. Burges and B. Scholkopf. Improving the accuracy and sp eed of supp ort

vector learning machines. In M. Mozer, M. Jordan, and T. Petsche, editors,

Advances in Neural Information Processing Systems 9, pages 375{381, Cambridge,

MA, 1997. MIT Press.

[10] C. Cortes and V. Vapnik. Supp ort vector networks. Machine Learning, 20:273 {

297, 1995.

[11] F. Girosi, M. Jones, and T. Poggio. Regularization theory and neural networks

architectures. Neural Computation, 72:219{269, 1995.

[12] D. Haussler. Convolutional kernels on discrete structures. Technical Rep ort 25

UCSC-CRL-99-10, Computer Science Department, University of California at

Santa Cruz, 1999.

[13] J. Mercer. Functions of p ositive and negativetyp e and their connection with the

theory of integral equations. Philos. Trans. Roy. Soc. London, A 209:415{446,

1909.

[14] E. Osuna, R. Freund, andF.Girosi. An improved training algorithm for sup-

p ort vector machines. In J. Princip e, L. Gile, N. Morgan, and E. Wilson, edi-

tors, Neural Networks for Signal Processing VII | Proceedings of the 1997 IEEE

Workshop, pages 276 { 285, New York, 1997. IEEE.

[15] J. Platt. Fast training of supp ort vector machines using sequential minimal op-

timization. In B. Scholkopf, C. J. C. Burges, and A. J. Smola, editors, Advances

in Kernel Methods | Support Vector Learning, pages 185{208, Cambridge, MA,

1999. MIT Press.

[16] T. Poggio. On optimal nonlinear asso ciative recall. Biological Cybernetics, 19:201{

209, 1975.

[17] B. Scholkopf. Support Vector Learning. R. Oldenb ourg Verlag, Munc  hen, 1997.

Doktorarb eit, TU Berlin.

[18] B. Scholkopf, C. Burges, and V. Vapnik. Extracting supp ort data for a given task.

In U. M. Fayyad and R. Uthurusamy, editors, Proceedings, First International

Conference on Know ledge Discovery & , Menlo Park, 1995. AAAI

Press.

[19] B. Scholkopf, C. J. C. Burges, and A. J. Smola. Advances in Kernel Methods |

Support Vector Learning. MIT Press, Cambridge, MA, 1999.

[20] B. Scholkopf, J. Platt, J. Shawe-Taylor, A.J. Smola, and R.C. Williamson. Esti-

mating the supp ort of a high-dimensional distribution. TR MSR 99 - 87, Microsoft

Research, Redmond, WA, 1999.

[21] B. Scholkopf, A. Smola, and K.-R. Muller.  Nonlinear comp onent analysis as a

kernel eigenvalue problem. Neural Computation, 10:1299{1319, 1998.

[22] B. Scholkopf, A. Smola, R. C. Williamson, and P. L. Bartlett. New supp ort vector

algorithms. Neural Computation, 12:1083 { 1121, 2000.

[23] A. Smola, B. Scholkopf, and K.-R. Muller.  The connection b etween regularization

op erators and supp ort vector kernels. Neural Networks, 11:637{649, 1998.

[24] A. J. Smola and B. Scholkopf. On a kernel{based metho d for pattern recognition,

regression, approximation and op erator inversion. Algorithmica, 22:211{231, 1998.

[25] A. J. Smola and B. Scholkopf. A tutorial on supp ort vector regression. Neuro-

COLTTechnical Rep ort NC-TR-98-030, Royal Holloway College, Universityof

London, UK, 1998.

[26] A.J. Smola, P.L. Bartlett, B. Scholkopf, and D. Schuurmans. Advances in Large

Margin Classi ers. MIT Press, Cambridge, MA, 2000.

[27] V. Vapnik. Estimation of Dependences Based on Empirical Data [in Russian].

Nauka, Moscow, 1979. English translation: Springer Verlag, New York, 1982.

[28] V. Vapnik. The Nature of Statistical Learning Theory. Springer, N.Y., 1995.

[29] V. Vapnik. Statistical Learning Theory. Wiley, N.Y., 1998.

[30] V. Vapnik and A. Chervonenkis. A note on one class of p erceptrons. Automation

and Remote Control, 25, 1964. 26

[31] V. Vapnik and A. Chervonenkis. Theory of Pattern Recognition [in Russian].

Nauka, Moscow, 1974. German Translation: W. Wapnik & A. Tscherwonenkis,

Theorie der Zeichenerkennung,Akademie{Verlag, Berlin, 1979.

[32] V. Vapnik and A. Lerner. Pattern recognition using generalized p ortrait metho d.

Automation and Remote Control, 24, 1963.

[33] G. Wahba. Spline Models for Observational Data, volume 59 of CBMS-NSF

Regional Conference Series in Applied Mathematics. SIAM, Philadelphia, 1990.

[34] C. Watkins. Dynamic alignment kernels. In A.J. Smola, P.L. Bartlett,

B. Scholkopf, and D. Schuurmans, editors, Advances in Large Margin Classi ers,

pages 39 { 50, Cambridge, MA, 2000. MIT Press.

[35] R. C. Williamson, A. J. Smola, and B. Scholkopf. Generalization p erformance of

regularization networks and supp ort vector machines via entropynumb ers of com-

pact op erators. Technical Rep ort 19, NeuroCOLT, http://www.neuro colt.com,

1998. Accepted for publication in IEEE Transactions on Information Theory. 27