A Framework for Structural Risk Minimisation

John Shawe-Taylor Peter L. Bartlett Robert C. Williamson Martin Anthony Computer Science Dept Systems Engineering Dept Engineering Dept Mathematics Dept Royal Holloway Australian National Univ Australian National Univ London School of Economics Canberra 0200 Australia Canberra 0200 Australia Houghton Street Egham, TW20 0EX, UK [email protected] [email protected] London WC2A 2AE, UK [email protected] [email protected]

Abstract time it would be foolish not to use as many examples as are available from the start. Hence, we suppose that a fixed num- The paper introduces a framework for studying ber of examples is allowed and that the aim of the learner is structural risk minimisation. The model views to bound the expected generalisation error with high confi- structural risk minimisation in a PAC context. It dence. The second drawback of the Linial et al.approach is then considers the more general case when the hi- that it is not clear how it can be adapted to handle the case erarchy of classes is chosen in response to the data. where errors are allowed on the training set. In this situa- This theoretically explains the impressive perfor- tion there is a need to trade off the number of errors with mance of the maximal margin hyperplane algo- the complexity of the class, since taking a class which is too rithm of Vapnik. It may also provide a general complex can result in a worse generalisation error (with a technique for exploitingserendipitous simplicityin fixed number of examples) than allowing some extra errors observed data to obtain better prediction accuracy in a more restricted class. from small training sets. The model we consider will allow a precise bound on the error arising in different classes and hence a reliable way of applying the structural risk minimisation principle introduced 1 Introduction by Vapnik [8, 10]. Indeed, the results reported in Sections 2 The standard PAC model of learning considers a fixed hy- and 3 of this paper are implicit in the cited references, though

we feel justified in presenting them in this framework as pothesis class H together with a required accuracy and

we believe they deserve greater attention and development. confidence 1 . The theory characterises when a target (We also make explicit some of the assumptions inherent function from H can be learned from examples in terms of the Vapnik-Chervonenkis dimension, a measure of the flex- in the presentations in [8, 10].) In Sections 4 and 5 we address a shortcoming of the SRM method which Vapnik [8] ibility of the class H and specifies sample sizes required to deliver the required accuracy with the allowed confidence. highlights: according to the SRM principle the structure has to be defined a priori before the training data appear. In many cases of practical interest the precise class containing An heuristic using maximal separation hyperplanes proposed the target function to be learned may not be known in advance. by Vapnik and coworkers [3] violates this principle in that The learner may only be given a hierarchy of classes the hierarchy defined depends on the data. We introduce a

framework which allows this dependency on the data and yet

H H H

1 2 d

can still place rigorous bounds on the generalisation error. As H and be told that the target will lie in one of the sets d . Linial, an example, we show that the maximal margin hyperplane Mansour and Rivest [5] studied learning in such a framework approach proposed by Vapnik falls into the framework, thus

by allowingthe learner to seek a consistent hypothesis in each placing the approach on a firm theoretical foundation and H subclass d in turn, drawing enough extra examples at each solving a long standing open problem in statistical induction. stage to ensure the correct level of accuracy and confidence The approach can be interpreted as a way of encoding our should a consistent hypothesis be found. bias, or prior assumptions, and possibly taking advantage of This paper addresses two shortcomings of the above ap- them if they happen to be correct. In the case of the fixed

proach. The first is the requirement to draw extra examples hierarchy, we expect the target (or a close approximation

H d

when seeking in a richer class. It may be unrealistic to as- to it) to be in a class d with small . In the maximal sume that examples can be obtained cheaply, and at the same separation case, we expect the target to be consistent with some classifying hyperplane that has a large separation from the examples. This corresponds to a collusion between the probability distribution and the target concept, which would be impossible to exploit in the standard PAC distribution independent framework. If these assumptions happen to be correct for the training data, we can be confident we have an

m d  accurate hypothesis from a small data set (at the expense of . We have deliberately omitted this dependence

some small penalty if they are incorrect). as they have a different status in the learning framework.

p  d i  d

(Vapnik [7] implicitly assumes i 1 , 1 ) The 2 Basic Example — No Training Errors corresponding term

4

p 

As an initial example we consider a hierarchy of classes ln d

m

H H H

1 2 d

m d  represents the overestimate of arising from our

prior uncertainty about which class to use. If we have any

H  d

where we will assume VCdim d for the rest of this section. Such a hierarchy of classes is termed a decomposable information about which classes are more likely to arise we

concept class by Linial et al. [5]. We will assume that a can use it to improve our bias. For example, if we know that

H p  d class d will occur then we choose 1 and recover the

fixed number m of labelled examples are given as a vector

t  x x   standard PAC model. If on the other hand the probabilities

z x x to the learner, where x 1 m , and q

of the function falling in different classes are known to be d ,

t tx tx  t

x 1 m , and that the target function lies

m d 

the expected overestimate in or loss will be H

in one of the subclasses d . The learner uses an algorithm

d h to find a value of which contains an hypothesis that is

consistent with the sample z. What we require is a function 4 X

L  q p  d

q d ln

m

m d 

which will give the learner an upper bound on the

d 1

generalisation error of h with confidence 1 . The following

Hence, the difference between the loss of p and that obtained

hjfi theorem gives a suitable function. We use Erz :

by using the values q as bias terms is

hx   tx gj h i

i to denote the number of errors that makes

hP f h   t g m

on z, and erP x: x x to denote the expected

 L  

L p q KL p q

x x error when 1 m are drawn independently according 4

to P .



where KL 0 is the Kullback-Leibler divergence be-

H p d

Theorem 1 Let i be as above, and let be any set of tween the two distributions. Hence, we have the following

P

p 

positive numbers satisfying d 1 With probability corollary.

d 1

m

1 over independent identically distributed example, q

Corollary 2 If we know the probabilities d of the target h

for any d for which a learner finds a consistent hypothesis H

falling into the classes d , then we minimise the expected

H h

in d , the generalisation error of is bounded by

p  q d

overestimate of the generalisationerror by choosing d .

4 2em 1 4 In this case the expected amount of the overestimate is

m d   

d ln ln ln

m d p

d 4

H q

Proof : The proof uses the standard bound on generalisation m



error for each of the classes but divides the confidence be- where H q is the entropy of the distribution q.

p H d tween the classes giving proportion d to class . Hence,

we show that 3 Structural Risk Minimisation

f d h H h Pr z : d Erz 0 and

Using the framework established in the previous section we

h m d g

erP now wish to consider the possibility of errors on the training

by showing that for all d sample.

We will make use of the following result of Vapnik in a

f h H h h m d g p

P d Pr z : d Er 0 er z slightly improved version of Anthony and Shawe-Taylor [2].

The probability on the left of the inequality is, however, Note also that the result is expressed in terms of the quantity

h

bounded with the usual analysis via Sauer’s lemma by Erz which denotes the number of errors of the hypothesis

h on the sample z, rather than the usual proportion of errors.

m d m

m Π

4 H 2 exp

d

4 Theorem 3 ([2]) Let 0 1 and 0 1. Suppose

H is an hypothesis space of functions from an input space

Π m

where H is the growth function: the maximum number

d

f g S 

X to 0 1 , and let be any probability measure on

m H m

of dichotomies implementable on points by d . Substi-

f g

X 0 1 . Then the probability (with respect to ) that

m

m d 

tuting the given value of gives the required bound

S h H

for z , there is some such that p

of d .

h h m  h

er and Erz 1 er p

The role of the numbers d may seem a little counter intuitive

is at most

2 m as we appear to be able to bias our estimate by adjusting these

Π m

4 H 2 exp parameters. The numbers must, however, be specified in 4 advance and represent some apportionment of our confidence

to the different points where failure might occur. In this Our aim will be to use a double stratification of ; as before p

sense they should be one of the arguments of the function by class (via d ), and also by the number of errors on the q

sample (via dk ). The generalisation error will be given as a though the rate of decrease could also be varied between d

function of the size of the sample m, index of the class , the classes. Assuming the above choice, an incremental search

d number of errors on the sample k , and the confidence . for the optimal value of would stop when the reduction in

the number of classification errors in the next class was less

h In what follows we will often write Erx (rather than

than

h t Erz ) when the target is obvious from the context.

2em

084 ln

H i 

Theorem 4 Let i , 1 2 , be defined as in Section 2

d

p q dk

and let d be any sets of positive numbers satisfying

m

Hence, for d we would need to reduce the number of

X

errors by on average about 142 per increment of VC class.

p 

d 1

(Note that the tradeoff between errors on the sample and

d 1 generalisation error is also discussed in [4].)

P

m

q  d

and dk 1 for all . Then with probability 1 k 0

over m independent identically distributed examples x, if the 4 A General Framework for Decomposing

h H hk

learner finds an hypothesis in d with Er , then x Classes

the generalisation error of h is bounded by

1 1 The standard PAC analysis gives bounds on generalisation

m d k  k 

2 4ln 

m q

dk error that are uniform over the hypothesis class. Decompos-

o n

2em 4 ing the hypothesis class, as described in Section 2, allows us



4 d ln ln

d p d to bias our generalisation error bounds in favour of certain Proof : Again we bound the required probability of failure target functions and distributions: those for which some hy-

pothesis low in the hierarchy is an accurate approximation. In

f d k h H h k

Pr z : d Erz this section, we introduce a more general framework which

h m d k g subsumes both the standard PAC model and the framework

erP

described in Section 2. This can be viewed as one way to k

by showing that for all d and allow the decomposition of the hypothesis class to be chosen

h H hk f based on the sample. This allows us to bias our generalisa- Pr z : d Erz

tion error bounds in favour of more general classes of target

h m d k g p q

d dk erP

functions and distributions, which might correspond to more d

We will apply Theorem 3 once for each value of k and .We realistic assumptions about practical learning problems.

 must therefore ensure that only one value of dk is used It seems that in order to allow the decomposition of the hy-

in each case. An appropriate value is pothesis class to depend on the sample, we need to make

k better use of the information provided by the sample. Both



dk 1 the standard PAC analysis and structural risk minimisation

m d k 

m with a fixed decomposition of the hypothesis class effec-

h m d k  hk This ensures that if erP and Erz , tively discard the training examples, and only make use of then the function Erz defined on the hypothesis class that is in-

duced by the training examples. The additional information

h k  m m d k  m  h

dk dk Er 1 1 erP z we exploit in the case of sample-based decompositions of the as required for an application of the theorem. Substitution of hypothesis class is encapsulated in a luckiness function. the appropriate parameters into the expression of the proba- The main idea is to fix in advance some assumption about the

bility and then simplifying gives the required result, by ig- target function and distribution, and encode this assumption noring one of the negative terms arising from squaring dk . in a real-valued function defined on the space of training samples and hypotheses. The value of the function indicates Note that the term ignored at the end of the proof could be the extent to which the assumption is satisfied for that sample

used to improve the error bound, particularly for large k , and hypothesis. We call this mapping a luckiness function, q

hence reducing the required size of the dk in these cases. As since it reflects how fortunate we are that our assumption is q

we will see below we may be content to choose dk small for satisfied. That is, we make use of a function

these values as we do not expect them to be useful in practice

m 

X H R L : in any case. For small k the term ignored is only marginally

less than 1. which measures the luckiness of a particular hypothesis with

q k The question of how to choose the prior dk for different respect to the training examples. Sometimes it is convenient will again affect the resulting trade-off between complexity to express this relationship in an inverted way, as an unluck- and accuracy. In view of our expectation that the penalty iness function,

term for choosing a large class is probably an overestimate, it

m 

X H R seems reasonable to give a correspondingly large penalty for U : large numbers of errors. One possibility is an exponentially decreasing prior distribution such as It turns out that only the ordering that the luckiness or unlucki-

ness functions impose on hypotheses is important. Recall that

k  

1

q  h hx hx  H  fh  h H g

dk m 2 x 1 and let jx x : .

H L We define the level of a function h relative to and x We can analyse the performance of the maximal margin clas- by the function sifier in terms of a certain sample-based decomposition of

the class of linear threshold functions. This decomposition

h jfg  g H L g L hgj x x : x x favours target functions and distributionsfor which it is likely

or that, for a large set of random examples, some linear thresh-

h jfg  g H U g U hgj x x : x x old function will correctly classify the set such that most

examples will lie far from the separating hyperplane. In

h L U Whether x is defined in terms of or is a matter of

Section 5, we apply the results of this section to show that

h convenience; the quantity x itself plays the central role in what follows. Vapnik’s approach is indeed well founded, and in particular that the generalisation error of the maximal margin classifier 4.1 Examples is considerably smaller when this assumption is satisfied. We use the following unluckiness function. Example 5 Consider the hierarchy of classes introduced in

Section 2 and define Definition 7 If h is a linear threshold function with separat-

w  w 

ing hyperplane defined by , and is in canonical

U h fd h H g

x min : d

form with respect to an m-sample x, then define

Then it follows from Sauer’s lemma that for any x we can 2 2 2

h f R  R A n  g 

U x min 256 1 2 1

h

bound x by

d

em

R  kx k A  kw k

im i

where max1 and .

h x

d By Theorem 6 the unluckiness function is a loose upper bound

m

 U h X

where d x . Notice also that for any y , on the VC dimension of the set of dichotomies realisable with

d

2em hyperplanes with at least as large a separation. Hence,

h

xy

d

d

em

h

x

x x y y 

d m

where xy 1 m 1 .

 U h The last observation is something that will prove useful later where d x . when we investigate how we can use luckiness on a sample to infer luckiness on a subsequent sample. 4.2 Probable Smoothness

For the case of linear threshold functions in Euclidean space, Definition 8 An -subsequence of a vector x is a vector x

Boser et al. [3] suggest that choosing the maximal margin obtained from x by deleting a fraction of at most coordi-

hyperplane (i.e. the hyperplane which maximises the min-

nates. We will also write x x.

imal distance of points Ð assuming a correct classification

h H A luckiness function L x defined on a function class

can be made) will improve the generalisation of the resulting

m L  is probably smooth with respect to functions and

classifier. They give evidence to indicate that the generalisa-

m L t H , if, for all targets in and for every distribution tion performance is frequently significantly better than that

P ,

predicted by the VC dimension of the full class of linear

2m

f   h H h  threshold functions. The following theorem gives circum- P

xy : Erx 0 x y xy

stantial evidence to indicate why this might occur.

hg h m L

x y  x

w  w

Consider a hyperplane defined by , where is a weight

 m L h

where x . X vector and a threshold value. Let 0 be a subset of the Euclidean space that does not have a limit point on the hy- The definition for probably smooth unluckiness is identical

perplane, so that U

except that L’s are replaced by ’s. The intuition behind

x w i  j

min jh 0 this rather arcane definition is that it captures when the luck-

X x 0 iness can be estimated from the first half of the sample with We say that the hyperplane is in canonical form with respect high confidence. In particular, we need to ensure that few to X 0 if

dichotomies are luckier than h on the double sample. That

x w i  j 

min jh 1 is, for a probably smooth luckiness function, if an hypothesis

X

x 0

L m h has luckiness on the first points, we know that, with

Let kk denote the Euclidean norm.

m L  high confidence, for most (at least a proportion )

Theorem 6 (Vapnik [8]) Suppose X0 is a subset of the in- of the points in a double sample, the growth function for the R

put space contained in a ball of radius about some point. class of functions that are at least as lucky as h is small (no

m L

Consider the set of hyperplanes in canonical form with re- more than ).

kw k A F

spect to X0 that satisfy , and let be the class of

p d  m

corresponding linear threshold functions, Theorem 9 Suppose d , 1 2 , are positive num- P

2m

p  L

bers satisfying i 1, is a luckiness function for

i 1

x w  hx w i   f sgn

a function class H that is probably smooth with respect to X

Then the restriction of F to the points in has VC dimension

m N 0 functions and , and 0 1 2. For any

bounded by

H P target function t and any distribution , with probabil-

2 2

R A ng 

minf 1

m ity 1 over independent examples x chosen according

 P i N h 

to , if for any a learner finds an hypothesis in For a fixed subsequence x y xy , define the event 

i 1

h m L h A

H with Erx 0 and x 2 , then the inside the last sum as . We can partition the group of

h m i  h

generalisation error of satisfies er P where permutations into a number of equivalence classes, so that,

i for all i, within each class all permutations map to a fixed

2 4

x y 

i  m i    i

1 log2 value unless x y contains both i and . Clearly, all

m p

i equivalence classes have equal probability, so we have

X

 m L hp  m  

4 x i 4 log2 2 1

A  AjC  C  U Pr Pr

Proof : Using Chernoff bounds,

C

m

AjC 

f h H i N h

P x : Erx 0 sup Pr

C 

i 1

h m i  m L h

x 2 erP where the sum and supremum are over equivalence classes

2m

P  h H i N h f



2 xy : Erx 0 C . But within an equivalence class, x y is a permutation

i

m 

1 of x y , so we can write

m L h h m i 

x 2 Ery 2

i

1

 h AjC   h H

m i 

provided m 8 , which follows from the defini- Pr Pr x y 2

m i 

tion of and the fact that 1 2. Hence it suffices

h  Erx 0

2m

P i N J   p

i i

to show that i 2 for each , where

m

h m i  j j j j C  Er y y

J y

i is the event 2

i

1

H

j 

sup sup

h H h m L h

xy : Erx 0 x 2 x y

C h

m

h m i  m

Ery

h h m i  C

2

   Pr Erx 0 Er y 2 (1)

Let S be the event

where the second supremum is over the subset of H for which

i

1

 f  h H h 

 h

xy : Erx 0 x y xy x y 2 . Clearly,

hg h m L 

x y x i 1

H

 

j x y 2

 m L  L i with i 2 . Since is probably smooth,

2m and the probability in (1) is no more than

P S 

i 2. It follows that

mmi   m

2 2

m m m

2 2 2 ø 2

P S  J   P J S P J

i i i

2m

J  P S  ø i

i 2 Combining these results, we have

2m

J S  J S P ø ø

i i It suffices then to show that i 2. But

2m

i m mi  m

2m ø 1 2 2

J P S  is a subset of i 2 2

2m

 f  h H h

R xy : Erx 0

 p i

and this is no more than i 2 2if

i

1

  h

x y xy x y 2

m

4

m

m i  m mi   m 

j m i  j jj h 2 log 2 1 2 log

Ery 2 y y 2 2

p

2 i

j where jy denotes the length of the sequence y . The theorem follows.

Now, if we consider the uniform distribution U on the group

mg i

of permutations on f1 2 that swap elements and

 m

i ,wehave 5 Examples of Probable Smoothness

2m

R U f  Rg

P sup : xy 

xy In this section, we consider three examples of luckiness func- m

2 tionsand show that they are probablysmooth. The first exam-

X  z  z

  m  where z for z . Fix xy

1 2 ple (Example 5) is the simplest; in this case luckiness depends

2m

X   

. For a subsequence x y xy , we let x y de- only on the hypothesis h and is independent of the examples

note the corresponding subsequence of the permuted version x. In the second example, luckiness depends only on the

  

of xy (and similarly for x and y ). Then examples, and is independent of the hypothesis. The third

f  Rg

U : xy example allows us to predict the generalisation performance

i

1 of the maximal margin classifier. In this case, luckiness

U   h H  h

: x y xy x y 2

clearly depends on both the examples and the hypothesis.

h 

Erx 0 If we consider Example 5, the unluckiness function is clearly

m

U

j m i  j j j h 

Er  y y

m U h  emU

y 2 probably smooth if we choose x 2

X

i

m U  m

1 and 0 for all and . The bound on generali-

 h U h H : x y 2

sation error that we obtain from Theorem 9 is almost identical

   

x y xy

to that given in Theorem 1.

h  Er 0

x The second example we consider involves examples lying on

m

h m i  j j j j  Ery 2 y y hyperplanes.

h F

Definition 10 Define the unluckiness function U x for a Lemma 14 Let be the set of linear functions with unit

U h f g R linear threshold function h as x dim span x the weight vectors, restricted to points in a ball of radius ,

dimension of the vector space spanned by the vectors x.

 fx hw xi kw k  g F : 1

Proposition 11 Let H be the class of linear threshold func- d Then the level fat shattering function can be bounded by

tions defined on R . The unluckiness function of Definition 10

U 2 2

 fR ng 

F

m U  emU is probably smooth with respect to 2 LFat min 1

and

X

Proof : If a set of points is to be level -shattered there

em d b

4 2 4 must be a value r such that each dichotomy can be realised

U  m U 

ln ln

w

m U

with a weight vector b such that

b  r 

if x 1

k

w xi Proof : The recognition of a dimensional subspace is a h

b

r

otherwise H

learning problem for the indicator functions k of the sub-

spaces. These have VC dimension k . Hence, applying the hi-

jhw xir j d 

X b

Let minx . Consider the hyperplane

p  d

erarchical approach of Theorem 1 taking k 1 , we obtain

w r  w d r d

defined by b . It is in canonical form

the given error bound for the number of examples in the sec- b

X kw dk  d

with respect to the points , satisfies b 1 and ond half of the sequence lying outside the subspace. Hence,

realises the given dichotomy. Hence, the set of points X can

 with probability 1 there will be a 1 -subsequence of be shattered by a subset of canonical hyperplanes satisfying

points all lying in the given subspace. For this sequence the

w k d

k 1 1 . The result follows from Theorem 6.

b

m U  growth function is bounded by .

The above example will be useful if we have a distribution Corollary 15 Let F be the set of linear functions with unit which is highly concentrated on the subspace with only a weight vectors, restricted to points in a ball of radius R about small probability of points lying outside it. We conjecture the origin. The fat shattering function of F can be bounded by

that it is possible to relax the assumption that the probability 2 2

 f R n  g  distribution is concentrated exactly on the subspace, to take Fat F min 4 1 1

advantage of a situation where it is concentrated around the R Proof : Suppose m points lying in a ball of radius about subspace and the classifications are compatible with a per-

the origin are -shattered with corresponding output values

pendicular projection onto the space. This will also make use

r r R x x . Clearly, . We extend the points into one extra

of both the data and the classification to decide the luckiness.

r x x 

dimension adding the value x for the point . That is,

x  x x r  r R x

We turn for the rest of the paper to the example of the maximal

d x x 1 d becomes 1 . Since , the margin hyperplanes and show that the unluckiness function p norm of the new vectors is bounded by 2R. By extending

defined in Definition 7 is probably smooth for appropriate each of the weight vectors realising the fat shattering of the (useful) choices of and . We begin by some preliminary points with coordinate 1, we see that the points are level definitions and results. p

-shattered at level 0. The norm of the weight vectors is 2, so to obtain unit weight vectors we must divide them by this

Definition 12 Let F be a set of real valued functions. We

amount. If we multiply the m vectors by the same amount

F say that a set of points X is -shattered by if there are real

the effect is to leave the output unchanged. Hence, we have

r x X

values x indexed by such that for all binary vectors

n 

generated a set of m points in 1 dimensional space lying

b X f F

indexed by , there is a function b satisfying in a ball of radius 2R, which are level -shattered by linear

functions with unit weight vectors. By the lemma, we have

b  r  x

x if 1

x

f 2 2

b

f R n  g 

m min 4 1 1.

r

x otherwise F The fat shattering dimension Fat F of the set is a function Before we can quote the next lemma, we need another defi-

from the positive real numbers to the integers which maps nition.

a value to the size of the largest -shattered set, if this is

X d A

finite or infinity otherwise. Definition 16 Let be a (pseudo-) metric space, let

B A

be a subset of X and 0. A set is an -cover

F

a A b B

Definition 13 Let be a set of real valued functions. We for A if, for every , there exists such that

X F

a b A N A say that a set of points is level -shattered by if it can be d

. The -covering number of , d , is the

r

x A

-shattered when choosing the constant across all inputs minimal cardinality of an -cover for (if there is no such

x F F . The level fat shattering dimension LFat of the set is a finite cover then it is defined to be ).

function from the positive real numbers to the integers which

B maps a value to the size of the largest level -shattered set, The idea is that should be finite and allow finite arguments

if this is finite or infinity otherwise. to be used while covering the full space A. In our case we will

l x x 

use the distance over a finite sample x 1 m Note that the second dimension is a scale sensitive version for the pseudo-metric in the space of functions,

of a dimension introduced by Vapnik [7]. The scale sensitive

jf x  g x j d f g i

version was first introduced by Alon et al. [1]. x max i i

i

N F  F b a i

We write x for the -covering number of with where i 2 . It suffices to show that the ’th term

i d

respect to the pseudo-metric x. in this sum is no more than 2 .

i k  k  i

We now quote a lemma from Alon et al. [1] which we will Fix and let and . Using the standard permu-

i  need in the proof of the next proposition. tation argument, we may fix a sequence xy and bound the probability under the uniform distribution on swapping per-

Lemma 17 (Alon et al. [1]) Let F be a class of functions mutations that the permuted sequence satisfies the condition

  P X

X 0 1 and a distribution over . Choose 0 1

B F

stated. Consider a minimal -cover of in the standard

d  

and let Fat F 4 . Then pseudo-metric with respect to the points xy. Let

d emd

log2

m

 fjf xj x g

4 r min : x

N F  E x 2 2

ˆ

F f B

If we now pick any f , there exists , with

m

X

where the expectation E is taken w.r.t. a sample x ˆ

f x f xj x x

j for all xy. Hence, for x, m

drawn according to P . ˆ

f xj r m m j , and there are at least 2 points

ˆ

X a b x j f xj r   r

Corollary 18 Let F be a class of functions and y which satisfy 2 . By the

m m

2

X d 

P a distribution over . Choose 0 1 and let permutation argument at most 2 of the sequences



Fat F 4 . Then obtained by swapping corresponding points satisfy the condi-

 embad

2 d log 2 tions. Hence, by Corollary 18 we may bound the probability

b a

4m of the event occurring by

E N F 

x 2 2

m m

2

N F 

E xy 2

m

X

where the expectation E is over samples x drawn

k  embak 

m 2 log 4

P

b a

according to . 8m

m m

2

2 2 2

Proof : We first scale all the functions in F by the affine

i

a b   transformation mapping the interval  to 0 1 to create

Now, this is no more than 2 if

F b a  F

the set of functions . Clearly, FatF Fat ,

m 

while 2 i

2 i 1

mba emba

1 8 4 2

k 

E N F   E N b a F 

log log log

x x i 2

m k

2 2 2

i

i i

The result follows.

m  and so we choose as in the proposition.

Proposition 19 Suppose F is a set of functions that map We are now in a position to state the result concerning maxi-

a b

from X to with finite fat-shattering dimension. Then mal margin hyperplanes. X for any distribution P on ,

Proposition 20 The unluckiness function of Definition 7 is 

2m

P  R f F U

xy :

m U  emU

n o

o probably smooth with 2 and

1

i m  jf y j jf x j

i

: min i 2 m

i 1 12

m U  em 

8ln 2 8ln

m

m 

where is

p

2

emba mba U

1 8 32 8em 12

k 

p

U Um 

2 log log 2

k m 2 2 log log (32 ) log

2 2 2 2

U

ba 4

log

2

Proof : Notice first that, for a set X0 of points and a hy-

 k  w  X F and Fat . perplane in canonical form with respect to 0, the

distance from the plane to the closest point in X 0 is

Proof : The required probability is no more than

w xi

h 1

 

i

2m min

P  fb a i Ng

xX

w k kw k kw k

xy : 2 : 0 k

n o

 kw k

We can therefore interchange the quantity A in Defi-

1

jf x j f F i jf y j

i : i min 2

m nition 7 with the inverse of this minimum distance. i

We will show that

n

 m

2

2m

P  h H h  

xy : Erx 0 x y xy

X

2m

ky k kx k

 f F P

max max i and

xy : i

i i

o

iN

n o

i  j jhw y jhw x i  j

h h i h min h min 2 1

1 i

i i

i jf y j jf x j

i i

: min i 2

m

i

 m U h w  h

where x , and h are such that

m 

2 i

h x hw xi  h : sgn h

kw k 

and h 1. It follows from this that If we substitute

2m

 d

P  h H h   f : 4

xy : Erx 0 x y xy

k  

F

U h hg U : Fat 2

x y 2 x 1 2 R

2 2 2

 R  R d n 

and Theorem 6 implies the result for  min 256 1 2 1

 U

U

m U  emU 

2

 m

m :

: 3

Now, define the events

b a  R

n : 4

  h H h E xy : Er 0

x in Proposition 19, we have that the probability(3) is less than

 

  ky k kx k

3if 2 1 1 2 1 is at least

x y xy max max i or

i

i i

o

2

R m R R

1 128em 8192 192

 U

jhw y i  j jhw x i  j

log log 2 log

h h h i h Ud d d

min min 2 m 2 2 2

i

i i

and and so it will suffice if we set equal to

p

n

em U

1 8 12

p

 Um  U

S   

 2 1 log log (32 ) log

xy : x y xy m 2 2 2

1 U

o

1

ky k kx k

i em  max max 

i 8ln 2 8ln 12

m

i i

p

for some 1. Then

U U Um 

log2 8em log2 (32 )

ø p

E  E jS  S  

Pr Pr Pr 2 log2 12 U The second term is the probability that an excessive number of points in the second half sample lie outside the minimal ball containing the points of the first half sample. This is the probability of failure in learning a class with VC dimension Combiningthe results of Theorem 9 and Proposition 20 gives

1, and hence it can be bounded by 3if the following corollary.

1

 em  p d  m

1 ln 2 ln 12 Corollary 21 Suppose d , for 1 2 , are positive

m P

2m

p 

numbers satisfying i 1. Suppose 0 1 2,

i 1

H P X

A similar argument allows us to neglect linear threshold func- t , and is a probability distribution on . Then with

w  kw k  j j R

m

tions defined by with 1 and , where probability 1 over independent examples x chosen

R  kx k

i i h

max (at the expense of another 3 and another according to P , if a learner finds an hypothesis that satisfies

h h proportion 1 of the second sample). Erx 0, then the generalisation error of is no more

To estimate the first probability in (2), we can restrict our than

 m  attention to the subsequence y of m 1 2 1 points

2 2em 4

 

U log log

R

U p m 2 2

which lie in a ball of radius . Hence, the conditional prob- i

ability in (2) is no more than 4 48

 em

8ln2 8ln

m p

n

i

m

2

 P  h H h 

xy : Er 0 x y xy

x 8em

p

Um  U log log (32 )

o 2 2

U

p

i  j jhw y jhw x i  j

h h i h

min h min 2 (3) i

48 U

i i

m

log log 2 1

2 p 2

i

  

where 2 1 1 2 1 . 1 2 U

U Um m O

 log log log

m p

Consider the class i

 U h

where U x for the unluckiness function of Defini-

F  fx hw xi  kw k  j j Rg

R : 1

 bU emU c tion 7, and i log2 2 . By augmenting the input vector with an extra (constant) com-

It seems likely that the bound in Corollary 21 can be im- F

ponent, we can transform R into a linear class. Corollary 15 proved. Nevertheless, the improvement over the standard shows that

PAC bounds can already be considerable, because the differ-

2 2

 R 

4R 1 ence between the dimension of the space and the VC dimen-

 n  

Fat F min 2 1

R 2 sion of the set of hypotheses with similar margins (that is, the

value of U ) can be very large. Vapnik [8] cites an example

13

F R

Notice that functionsin R map from pointsin a radius ball where the space is 10 dimensional while the effective VC

3

 R R d  jhw x i  j

h i h about the origin to 2 2 . Let min i . dimension is approximately 10 . Acknowledgements This work was supported by the Australian Research Council.

References [1] N. Alon, S. Ben-David, N. Cesa-Bianchi, D. Haussler, “Scale-sensitive Dimensions, Uniform Convergence, and Learnability,” in Proceedings of the Conference on Foun- dations of Computer Science (FOCS), 1993. Also to ap- pear in Journal of the ACM. [2] Martin Anthony and John Shawe-Taylor, “A Result of Vapnik with Applications,” Discrete Applied Mathemat- ics, 47, 207Ð217, (1993). [3] B. Boser, I. Guyon, and V.N. Vapnik, “A Training Algo- rithm for Optimal Margin Classifiers,” pages 144Ð152 in Proceedings of the Fifth Annual Workshop on Computa- tional Learning Theory, Pittsburgh ACM, (1992) [4] and Vladimir Vapnik, “Support-Vector Networks,” , 20, 273Ð297 (1995). [5] Nathan Linial, Yishay Mansour and Ronald L. Rivest, “Results on Learnability and the Vapnik-Chervonenkis Dimension,” Information and Computation, 90 33Ð49, (1991). [6] D. Pollard, Convergence of Stochastic Processes, Springer, New York, 1984. [7] Vladimir N. Vapnik, Estimation of Dependences Based on Empirical Data, Springer-Verlag, New York, 1982. [8] Vladimir N. Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag, 1995. [9] V.N. Vapnik and A. Ja. Chervonenkis, “On the Uniform Convergence of Relative Frequencies of Events to their Probabilities,” Theory of Probability and Applications, 16, 264Ð280 (1971). [10] V.N. Vapnik and A. Ja. Chervonenkis, “Ordered Risk Minimization (I and II)”, Automation and Remote Con- trol, 34, 1226Ð1235 and 1403Ð1412 (1974).