An Experimental and Theoretical Comparison of Model Selection

An Experimental and Theoretical Comparison

of Model Selection Methods z Michael Kearns Yishay Mansoury Andrew Y. Ng Dana Ron AT&T Bell Laboratories Tel Aviv University Carnegie Mellon University Hebrew University Murray Hill, New Jersey Tel Aviv, Israel Pittsburgh, Pennsylvania Jerusalem, Israel

1 Introduction selection algorithm (perhaps based in part on some known properties of the model selection problem being confronted). In the model selection problem, we must balance the com- The summary of the paper follows. In Section 2, we provide a plexity of a statistical model with its goodness of ®t to the formalization of the model selection problem. In this formal- training data. This problem arises repeatedly in statistical es- ization, we isolate the problem of choosing the appropriate timation, machine learning, and scienti®c inquiry in general. complexity for a hypothesis or model. We also introduce the Instances of the model selection problem include choosing speci®c model selection problem that will be the basis for the best number of hidden nodes in a neural network, de- our experimental results, and describe an initial experiment termining the right amount of pruning to be performed on demonstrating that the problem is nontrivial. In Section 3, we a decision tree, and choosing the degree of a polynomial ®t introduce the three model selection algorithms we examine to a set of points. In each of these cases, the goal is not to in the experiments: Vapnik's Guaranteed Risk Minimization minimize the error on the training data, but to minimize the (GRM) [11], an instantiation of Rissanen's Minimum De- resulting generalization error. scription Length Principle (MDL) [7], and Cross Validation Many model selection algorithms have been proposed in the (CV). literatureof several different research communities, too many Section 4 describes our controlled experimental comparison to productively survey here. (A more detailed history of the of the three algorithms. Using arti®cially generated data from problem will be given in the full paper.) Perhaps surprisingly, a known target function allows us to plot complete learning despite the many proposed solutions for model selection and curves for the three algorithms over a wide range of sample the diverse methods of analysis, direct comparisons between sizes, and to directly compare the resulting generalization er- the different proposals (either experimental or theoretical) ror to the hypothesis complexity selected by each algorithm. are rare. It also allows us to investigate the effects of varying other nat- The goal of this paper is to provide such a comparison, ural parameters of the problem, such as theamount of noise in and more importantly, to describe the general conclusions to the data. These experiments support the followingassertions: which it has led. Relying on evidence that is divided between the behavior of the algorithms examined can be complex and controlled experimental results and related formal analysis, incomparable, even on simple problems, and there are fun- we compare three well-known model selection algorithms. damental dif®culties in identifying a ªbestº algorithm; there We attempt to identify their relative and absolute strengths is a strong connection between hypothesis complexity and and weaknesses, and we provide some general methods for generalization error; and it may be impossible to uniformly analyzing the behavior and performance of model selection improve the performance of the algorithms by slight mod- algorithms. Our hope is that these results may aid the in- i®cations (such as introducing constant multipliers on the formed practitioner in making an educated choice of model complexity penalty terms).

This research was done while Y. Mansour, A. Ng and D. Ron In Sections 5, 6 and 7 we turn our efforts to formal results were visiting AT&T Bell Laboratories. providing explanation and support for the experimental ®nd- y Supported in part by The Israel Science Foundation, adminis- ings. We begin in Section 5 by upper bounding the error of tered by The Israel Academy of Science and Humanities, and by a any model selection algorithmfallingintoa wideclass (called grant of the Israeli Ministry of Science and Technology. penalty-based algorithms) that includes both GRM and MDL z Supported by the Eshkol Fellowship, sponsored by the Israeli (but not cross validation). The form of this bound highlights Ministry of Science. thecompeting desires for powerful hypotheses and controlled complexity. In Section 6, we upper bound the additional error suffered by cross validation compared to any other model selection algorithm. This quality of this bound depends on the extent to which the function classes have learning curves obeying a classical power law. Finally, in Section 7, we give an impossibility result demonstrating a fundamental handi- cap suffered by the entire class of penalty-based algorithms the training error. If we are lucky, we have in our possession

that does not af¯ict cross validation. In Section 8, we weigh a (possibly randomized) learning algorithm L that takes as d the evidence and ®nd that it provides concrete arguments fa- input any sample S and any complexity value , and outputs

h VS d

voring the use of cross validation (or at least cause for caution amember d of (using some unspeci®ed criterion to

VS dj >

in using any penalty-based algorithm). break ties if j 1). More generally, it may be the

d case that ®nding any function in VS is intractable, and

that L is simply a heuristic (such as backpropagation or ID3)

h 2 F d

2 De®nitions that does the best job it can at ®nding d with small d training error on input S and . In either case, we de®ne

Ä Ä

= LS; d d= d h

Throughout the paper we assume that a ®xed boolean target h

L;S d

d and Ã Ã Ã . Note that we

d d function is used to label inputs drawn randomly accord-

expect Ã , like opt , to be a non-increasing function of D

ing to a ®xed distribution . For any boolean function d Ð by going to a larger complexity, we can only reduce our

h h= h

,wede®nethegeneralization error f;D training error.

Pr [hx 6= f x] S

2D x .Weuse to denote the random vari-

We can now give a precise statement of the model selec-

S = hx ;b i;:::;hx ;b i m m able 1 1 m ,where is the sample size,

tion problem. First of all, an instance of the model se- x

each i is drawn randomly and independently according to

fF g;f;D;L

lection problem consists of a tuple d ,where

D b = f x c c 2f ; g

i i i

,and i , where the noise bit 0 1 is 1

fF g f

d is the hypothesis function class sequence, is the tar-

2 [ ; =

with probability ; we call 0 1 2 the noise rate.Inthe L

get function, D is the input distribution, and is the un-

6= case that 0, we will sometimes wish to discuss the gen- derlying learning algorithm. The model selection problem

eralization error of h with respect to the noisy examples, so

is then: Given the sample S , and the sequence of func-

c h Pr [hx 6= f x c]

2D;c we de®ne x ,where is the

Ä Ä

h = LS; ;:::;h = LS; d;:::

tions 1 1 d determined by

h h noise bit. Note that and are related by the equality

L d

h= h+ h = h+ 1 1 1 2 .For the learning algorithm , select a complexity value such Ä that h Ä minimizes the resulting generalization error. Thus, a

simplicity, we will use the expression ªwith high probabilityº d

to mean with probability 1 over the draw of ,atacost model selection algorithm is given both the sample and the

= of a factor of log1 in the bounds Ð thus, our bounds sequence of (increasingly complex) hypotheses derived by all contain ªhiddenº logarithmic factors, but our handling from S , and must choose one of these hypotheses. of con®dence is entirely standard and will be spelled out in The current formalization suf®ces to motivatea key de®nition the full paper. and a discussion of the fundamental issues in model selection.

d= d h d d

We assume a nested sequence of hypothesis classes (or mod- We de®ne L;S . Thus, is a random

F f

F variable (determined by the random variable )thatgives els) d . The target function may or

1 Ä

h L

may not be contained in any of these classes, so we de®ne the true generalization error of the function d chosen by

F d

h d h hg

f from the class . Of course, is not directly accessible

opt d

d argmin and (similarly,

h2F

to a model selection algorithm; it can only be estimated or

d h h d

). Thus, d is the best approximation to opt

guessed in various ways from the sample S .Asimplebut

f D F d opt (with respect to ) in the class d ,and measures

important observation is that no model selection algorithm

the quality of this approximation. Note that opt is a non-

fdg

can achieve generalization error less than min d . Thus

increasing function of d since the hypothesis function classes

d the behavior of the function Ð especially the location are nested. Thus, larger values of d can only improve the potential approximative power of the hypothesis class. Of and value of its minimum Ð is in some sense the essential course, the dif®culty is to realize this potential on the basis quantity of interest in model selection.

of a small sample. The prevailing folk wisdom in several research communities

d With this notation, the model selection problem can be stated posits that will typically have a global minimum that is

nontrivial Ð that is, at an ªintermediateº value of d away

informally: on the basis of a random sample S of a ®xed size

= d m

Ä from the extremes d 0and . As a demonstration d m, the goal is to choose a hypothesis complexity ,andahy-

Ä of the validity of this view, and as an introduction to a par-

2 F pothesis h Ä, such that the resulting generalization error

Ä d ticular model selection problem that we will examine in our

h is minimized. In many treatments of model selection, experiments, we call the reader's attention to Figure 1. In including ours, it is explicitly or implicitly assumed that the this model selection problem (which we shall refer to as the model selection algorithm has control only over the choice Ä intervals model selection problem), the input domain is sim-

of the complexity d, but not over the choice of the ®nal hy-

[ ; ] F

Ä ply the real line segment 0 1 , and the hypothesis class d is

2 F

pothesis h Ä. It is assumed that there is a ®xed algorithm

; ] simply the class of all boolean functions over [0 1 in which

that chooses a set of candidate hypotheses, one from each

d F

we allow at most alternationsof label; thus d is the class of hypothesis class. Given this set of candidate hypotheses, the

all binary step functions with at most d=2 steps. For the ex- model selection algorithm then chooses one of the candidates

periments, the underlying learning algorithm L that we have as the ®nal hypothesis. implemented performs training error minimization. This is a

To make these ideas more precise, we de®ne the training rare case where ef®cient minimization is possible; we have

h= h jfhx ;b i2S hx 6= b gj=m

developed an algorithm based on dynamic programming that

i i i i

error Ã ÃS : ,and

d=VS d fh 2 F h=

VS runs in linear time, thus making experiments on large sam- d

the version space S :Ã

f h gg VS d F

ples feasible. The sample was generated using the target

2F d

min h Ã . Note that may contain more

F [ ; ]

F function in that divides 0 1 into 100 segments of equal than one function in d Ð several functions may minimize 100

3 m

= d hx 6= hx

width 1 100 and alternating label. In Figure 1 we plot that i 1 ) . This takes log bits; dividing by

m =m d=m

(which we can calculate exactly, since we have chosen the to normalize, we obtain 1 log H [2],

m =

target function) when S consists of 2000 random ex-

h where H is the binary entropy function. Now given , amples (drawn from the uniform input distribution)corrupted

the labels in S can be described simply by coding the mis-

= :

by noise at the rate 0 2. For our current discussion it

h i hx 6= f x i

takes of (that is, those indices where i ),

suf®ces to note that does indeed experience a nontrivial

h at a normalized cost of H Ã . Technically, in the coding minimum. Not surprisingly, this minimum occurs near (but scheme just described we also need to specify the values of

not exactly at) the target complexity of 100.

h m d and Ã , but the cost of these is negligible. Thus, the version of MDL that we shall examine for the intervals model Ä 3 Three Algorithms for Model Selection selection problem dictates the following choice of d:

d = d + Hd=mg: argmin fH Ã 2 The ®rst two model selection algorithms we consider are d

members of a general class that we shall informally refer to as In the context of model selection, GRM and MDL can both be

d d penalty-based algorithms (and shall formally de®ne shortly). interpreted as attempts to model by transforming Ã

The common theme behind these algorithms is their attempt and d. More formally, a model selection algorithm of the

d to construct an approximation to solely on the basis of form

d = fGd; d=mg

d d argmin Ã 3

the training error Ã and the complexity , often by trying d

d d to ªcorrectº Ã by the amount that it underestimates shall be called a penalty-based algorithm 4. Notice that an

through the addition of a ªcomplexity penaltyº term.

d; d=m

ideal penalty-based algorithm would obey G Ã

d Gd; d=m d Ä (or at least Ã and would be minimized In Vapnik's Guaranteed Risk Minimization (GRM) [11], d is chosen according to the rule 1 by the same value of d).

Ä p

f = d+d=m + + dm=dg d argmin Ã 1 1 Ã 1 The third model selection algorithm that we examine has a d different spirit than the penalty-based algorithms. In cross

where for convenience but without loss of generality we have

validation (CV) [9, 10], we use only a fraction 1 of

assumed that d is the Vapnik-Chervonenkis dimension [11, Ä

h 2

the examples in S to obtain the hypothesis sequence 1

0 0

12] of the class d ; this assumption holds in the intervals Ä Ä

;:::;h 2 F ;::: h LS ;d S F

d d d Ðthatis, is now ,where

model selection problem. The origin of this rule can be 1

m S 2 [ ; ] consists of the ®rst 1 examples in .Here 0 1

summarized as follows: it has been shown [11] (ignoring is a parameter of the CV algorithm whose tuning we discuss

d h 2 F

logarithmic factors) that for every and for every d , Ä d

p brie¯y later. CV chooses according to the rule

jh hj jd

d=m is an upper bound on Ã and hence Ã

p p

Ä 00 Ä

f d = h g d

argmin ÃS 4

d=m d=m d dj

. Thus, by simply adding to Ã ,we

00 Ä Ä

h h S m

ensure that the resulting sum upper bounds ,andifwe

d d where ÃS is the error of on ,thelast examples

are optimistic we might further hope that the sum is in fact Ä

S h

of that werewithheld in selecting d . Noticethat for CV,we

a close approximation to , and that its minimization is Ä

d= h

expect the quantity d to be (perhaps considerably)

therefore tantamount to the minimization of . The actual Ä h

larger than in the case of GRM and MDL, because now d

rule given in Equation (1) is slightly more complex than this,

was chosen on the basis of only 1 examples rather

d dj

and re¯ects a re®ned bound on j Ã that varies from m

p than all examples. For this reason we wish to introducethe

d d=m

d=m Ä

d h

for Ã close to 0 to otherwise.

more general notation d to indicate the fraction

d of the sample withheld from training. CV settles for

The next algorithm we consider, the Minimum Description 0

Length Principle (MDL) [5, 6, 7, 1, 4] has rather different ori- instead of in order to have an independent test set with

d gins than GRM. MDL is actually a broad class of algorithms which to directly estimate . with a common information-theoretic motivation, each algorithm determined by the choice of a speci®c coding scheme 4 A Controlled Experimental Comparison for both functions and their training errors; this two-part code

is then used to describe the training sample S . To illustrate Our results begin with a comparison of the performance and the method, we give a coding scheme for the intervals model properties of the three model selection algorithms in a care-

2 fully controlled experimental setting Ð namely, the intervals d

selection problem .Leth be a function with exactly alter-

2 F h model selection problem. Among the advantages of such

nations of label (thus, d ). To describe the behavior of

controlled experiments, at least in comparison to empirical

h S = fhx ;b ig i on the sample i , we can simply specify the

results on data of unknown origin, are our ability to exactly

h i d inputs where switches value (that is, the indices such measure generalization error (since we know the target func- 1Vapnik's original GRM actually multiplies the second term tion and the distribution generating the data), and our ability

inside the argmin fg above by a logarithmic factor intended to guard

against worst-case choices from VS . Since this factor renders In the full paper we justify our use of the sample points to h

GRM uncompetitive on the ensuing experiments, we consider this describe h; it is quite similar to representing using a grid of

m p modi®ed and quite competitive rule whose spirit is the same. resolution 1 =p for some polynomial .

2Our goal here is simply to give one reasonable instantiation 4With appropriately modi®ed assumptions, all of the formal re-

d;d;m

of MDL. Other coding schemes are obviously possible; however, sults in the paper hold for the more general form G Ã , m several of our formal results will hold for essentially all MDL where we decouple the dependence on d and .However,the instantiations. simpler coupled form will suf®ce for our purposes.

= = : to precisely study the effects of varying parameters of thedata s 100 and 0 2). This ªovercodingº behavior of MDL (such as noise rate, target function complexity, and sample is actually preferable, in terms of generalization error, to the size), on the performance of model selection algorithms. The initial ªundercodingº behavior of GRM, as veri®ed by Fig- Ä experimental behavior we observe foreshadows a number of ure 2. Once dGRM approaches 100, however, the overcoding important themes that we shall revisit in our formal results. of MDL is a relative liability, resulting in the second regime. Figure 3 clearly shows that the transition from the second We begin with Figure 2. To obtain this ®gure, a training sam- to the third regime (where approximate parity is achieved) ple was generated from the uniform input distribution and Ä

is the direct result of a dramatic correction to dMDL from

; ]

labeled according to an intervals function over [0 1 consist- Ä d d0 (de®ned above) to the target value of 100. Finally, CV ing of 100 intervals of alternating label and equal width 5;the Ä

makes a more rapid but noisier approach to 100 than dGRM,

= : sample was corrupted with noise at rate 0 2. In Figure 2, and in fact also overshoots 100, but much less dramatically we have plotted the true generalization errors (measured with Ä

than dMDL. This more rapid initial increase again results in respect to the noise-free source of examples) GRM, MDL and

superior generalization error compared to GRM for small m,

= : CV (using test fraction 0 1 for CV) of the hypotheses Ä

Ä Ä but the inability of dCV to settle at 100 results in slightly

h ;:::;h ;::: selected from the sequence 1 d by each the three higher error for larger m. In the full paper, we examine the algorithms as a function of sample size m, which ranged same plots of generalization error and hypothesis complexity from 1 to 3000 examples. As described in Section 2, the hy-

Ä for different values of the noise rate; here it must suf®ce to h

potheses d were obtained by minimizing the training error =

say that for 0, all three algorithms have comparable per- F

within each class d . Details of the code used to perform formance for all sample sizes, and as increases so do the

these experiments will be provided in the full paper.

= : qualitative effects discussed here for the 0 2 case (for Figure 2 demonstrates the subtlety involved in comparing instance, the duration of the second regime, where MDL is the three algorithms: in particular, we see that none of the vastly inferior, increases with the noise rate).

three algorithms outperforms the others for all sample sizes. Ä Ä d The behavior dGRM and MDL in Figure 3 can be traced to the Thus we can immediately dismiss the notion that one of form of the total penalty functions for the two algorithms. the algorithms examined can be said to be optimal for this For instance, in Figures 4, and 5, we plot the total MDL

problem in any standard sense. Getting into the details, we

d + Hd=m d penalty H Ã as a function of for the ®xed

see that there is an initial regime (for m from 1 to slightlyless = sample sizes m 2000 and 4000 respectively, again using

than 1000) in which MDL is the lowest of the three errors,

= : m = noise rate 0 20. At 2000, we see that the total sometimes outperforming GRM by a considerable margin. penalty has its global minimum at approximately 650, which

Then there is a second regime (for m about 1000 to about is roughly the zero training error value d0 discussed above 2500) where an interesting reversal of relative performance (we are still in the MDL overcoding regime at this sample occurs, since now GRM is the lowest error, considerably size; see Figures 2 and 3). However, by this sample size, outperforming MDL, which has temporarily leveled off. In a signi®cant local minimum has developed near the target

both of these ®rst two regimes, CV remains the intermediate

= m = value of d 100. At 4000, this local minimum

performer. In the third and ®nal regime, MDL decreases =

at d 100 has become the global minimum. The rapid rapidly to match GRM and the slightly larger CV,andthe Ä transition of dMDL that marks the start of the ®nal regime performance of all three algorithms remains quite similar for of generalization error is thus explained by the switching all larger sample sizes. of the global total penalty minimum from d0 to 100. In

Insight into the causes of Figure 2 is given by Figure 3, Figures 6, we plot the total GRM penalty, just for the sample = where for the same runs used to obtain Figure 2, we instead size m 2000. The behavior of the GRM penalty is much

Ä Ä Ä Ä

d d d plot the quantities dGRM, MDL and CV,thevalueof cho- more controlled Ð for each sample size, the total penalty

sen by GRM, MDL and CV respectively (thus, the ªcorrectº has a single-minimum bowl shape, with the minimum lying =

value, in the sense of simply having the same number of to the left of d 100 for small sample sizes and gradually

= m

intervals as the target function, is 100). Here we see that moving over d 100 and sharpening there for large ;as =

for small sample sizes, corresponding to the ®rst regime dis- Figure 6 shows, the minimum already lies at d 100 by = Ä m cussed for Figure 2 above, dGRM is slowly approaching 100 2000, as con®rmed by Figure 3.

from below, reaching and remaining at the target value for A natural question to pose after examining Figures 2 and 3 = about m 1500. Although we have not shown it explic- is the following: is there a penalty-based algorithm that en- itly, GRM is incurring nonzero training error throughout the joys the best properties of both GRM and MDL? By this entire range of m. In comparison, for a long initial period we would mean an algorithm that approaches the ªcorrectº

(corresponding to the ®rst two regimes of m), MDL is simply d value (whatever it may be for the problem in hand) more choosing the shortest hypothesis that incurs no training error rapidly than GRM, but does so without suffering the long, (and thus encodes both ªlegitimateº intervals and noise), and Ä uncontrolled ªovercodingº period of MDL. An obvious can- consequently dMDL grows in an uncontrolled fashion. More Ä didate for such an algorithm is simply a modi®ed version of precisely, it can be shown that during this period dMDL is GRM or MDL, in which we reason (for example) that per-

Ä 2

d m + s obeying dMDL 0 2 1 1 2 ,where haps the GRM penalty for complexity is too large for this s is the number of (equally spaced) intervals in the target problem (resulting in the initial reluctance to code), and we function and is the noise rate (so for the current experiment thus multiply the complexity penalty term in the GRM rule

(the second term inside the argminfg) in Equation (1) by a 5Similar results hold for a randomly chosen target function.

F G [ ; ] < ! <

constant less than 1 (or analogously, multiply the MDL com- sion of d .Let : 0 1 be a function that is con-

plexity penalty term by a constant greater than 1 to reduce tinuous and increasing in both its arguments, and let G overcoding). The results of an experiment on such a mod- denote the expected generalizationerror of the penalty-based

= fGd; d=mg i®ed version of GRM are shown in Figures 7 and 8, where model selection algorithm d argmin Ã on 7 d the original GRM performance is compared to a modi®ed a training sample of size m.Then version in which the complexity penalty is multiplied by 0.5. q

m R m+ d=m G Interestingly and perhaps unfortunately, we see that there is G 5

no free lunch: while the modi®ed version does indeed code

R m f dg d where G approaches min opt (which is the best

more rapidly and thus reduce the small m generalization er- F

ror, this comes at the cost of a subsequent overcoding regime generalization error achievable in any of the classes d )as !1 with a corresponding degradation in generalization error (and m . The rate of this approach will depend on properties

of G. = in fact a considerably slower return to d 100 than MDL under the same conditions) 6. The reverse phenomenon (re-

luctance to code) is experienced for MDL with an increased Proof: For any value of d, we have the inequality

Ä Ä

d; d=m G d; d=m : complexity penalty multiplier (details in the full paper). G Ã Ã 6

Let us summarize the key points demonstrated by these ex- Ä

Gd; d=m because d is chosen to minimize Ã .Usingthe

periments. First, none of the three algorithms dominates the p

h hj d=m uniform convergence bound j Ã for all

others for all sample sizes. Second, the two penalty-based

h 2 F G;

d and the fact that is increasing in its ®rst

algorithms seem to have a bias either towards or against cod- Ä

d argument, we can replace the occurrence of Ã on the left- ing that is overcome by the inherent properties of the data q

Ä Ä

d d=m asymptotically, but that can have a large effect on generaliza- hand side of Equation (6) by to obtain a smaller

tion error for small to moderate sample sizes. Third, this bias

d quantity, and we can replace the occurrence of Ã on the

cannot be overcome simply by adjusting the relative weight p

d+ d=m t right-hand sideby op to obtain alarger quantity. of error and complexity penalties, without reversing the bias

This gives

of the resulting rule and suffering increased generalization

error for some range of m. Fourth, while CV is not the best Ä Ä Ä

G d d=m; d=m G d+ d=m; d=m : opt

of the algorithms for any value of m, it does manage to fairly

closely track the best penalty-based algorithm for each value 7

; of , and considerably beats both GRM and MDL in their Now because G is an increasing function of its second

regimes of weakness. We now turn our attention to our for- argument, we can further weaken Equation (7) to obtain

mal results, where each of these key points will be developed p

Ä Ä

G d d=m; G d+ d=m; d=m :

further. 0 opt

x= Gx; G; 5 A Bound on Generalization Error for If we de®ne G0 0 ,thensince is increasing

Penalty-Based Algorithms in its ®rst argument, G0 is well-de®ned,and we may write

We begin our formal results with a bound on the general- Ä 1 Ä

d G d=m; d=m + d=m : G d+

opt 9

ization error for penalty-based algorithms that enjoys three 0

features. First, it is general: it applies to practically any Now ®x any small value >0. For this ,let be the

d f dg +

d opt

penalty-based algorithm, and holds for any model selection smallest value satisfying opt min Ð 0 problem (of course, there is a price to pay for such generality, thus, d is suf®cient complexity to almost match the ap-

as discussed below). Second, for certain algorithms and cer- proximative power of arbitrarily large complexity. Exam-

0 0

1 0

G G d + d =m ; d =m

tain problems the bound can give rapid rates of convergence ining the behavior of 0 opt as !1

to small error. Third, the form of the bound is suggestive m , we see that the arguments approach the point

0 0 0

1 0

d ; G G d + d =m; d =m of some of the behavior seen in the experimental results. opt

opt 0 ,andso 0 ap-

We state the bound for the special but natural case in which 1 0

G G d ; = d f dg+

opt opt proaches 0 opt 0 min

the underlying learning algorithm L is training error mini-

by continuity of , as desired. By de®ning

n o

mization; in the full paper, we will present a straightforward p

G G d+ m d=m; d=m

analogue for more general . Both this theorem and Theo- R opt G min 0

rem 2 in the following section are stated for the noise-free d 10 case; but again, straightforward generalizations to the noisy we obtain the statement of the theorem. case will be included in the full paper.

Let us now discuss the form of the bound given in Theorem 1.

fF g;f;D;L

R m d

Theorem 1 Let be an instance of the model The ®rst term G approaches the optimal generalization

F m

selection problem in which performs training error mini- error within d in the limit of large , and the second

mization, and assume for convenience that d is the VC dimen- term directly penalizes large complexity. These terms may

R m

be thought of as competing. In order for G to approach 6Similar results are obtained in experiments in which every oc- 7

currence of d in the GRM rule is replaced by an ªeffective dimen- We remind the reader that our bounds contain hidden logarith-

d c < sionº c0 for any constant 0 1. mic factors that we specify in the full paper.

f dg fH d + H d=mg

opt opt

min d rapidly and not just asymptotically (that is, min 2

in order to have a fast rate of convergence), should q

not penalize complexity too strongly, which is obviously at Ä

+ d =m q MDL (13) Ä odds with the optimization of the term d=m. For example,

H y y Hx + y Hx+

wherewehaveused and

d; d=m= d+d=m

consider G Ã Ã for some power

H . Again, we emphasize that the bound given by Equa-

d m > 0. Assuming , this rule is conservative (large Ä tion (13) is vacuous without a bound on dMDL,whichwe

penalty for complexity) for small ,and liberal (small penalty q

know from the experiments can be of order m.However,

d=m for complexity) for large . Thus, to make the term by combining this bound with an analysis of the behavior of

Ä d small we would like to be small, to prevent the choice MDL for the intervals problem, we can give an accurate the-

Ä p

f d+ R m= d=m +

d oretical explanation for the experimental ®ndings for MDL

d opt

of large .However, G min

d=m g , which increases as decreases, thus encouraging (details in the full paper).

large (liberal coding). As a ®nal example, we apply Theorem 1 to a variant of MDL

; Ideally, we might want G to balance the two terms of in which the penalty for coding is increased over the original,

d; d=m= Hd + = Hd=m

the bound, which implicitly involves ®nding an appropri- namely G Ã Ã 1 where m Ä d ately controlled but suf®ciently rapid rate of increase in d. is a parameter that may depend on and . Assuming that Ä The tension between these two criteria in the bound echoes we never choose d whose total penalty is larger than 1 (which the same tension that was seen experimentally: for MDL, holds if we simply add the ªfair coin hypothesisº to F1), we

d=m Hx x x there was a long period of essentially uncontrolled growth of have that H .Since ,forall ,it

Ä q m

d (linear in ), and this uncontrolled growth prevented any Ä

=m follows that d .If is some decreasing function signi®cant decay of generalization error (Figures 2 and 3).

m < < d Ä of m (say, for some 0 1), then the bound on GRM had controlled growth of d, and thus would incur negligible error from our second term Ð but perhaps this growth given by Theorem 1 decreases at a reasonable rate.

was too controlled, as it results in the initially slow (small m) decrease in generalization error. 6 A Bound on the Additional Error of CV To examine these issues further, we now apply the bound In this section we state a general theorem bounding the addi- of Theorem 1 to several penalty-based algorithms. In some tional generalization error suffered by cross validation com- cases the ®nal form of the bound given in the theorem state- paredtoanypolynomial complexity model selection algo-

ment, while easy to interpret, is unnecessarily coarse, and m rithm M . By this we mean that given a sample of size ,

better rates of convergence can be obtained by directly ap- Ä k

d m algorithm M will never choose a value of larger than pealing to the proof of the theorem.

for some ®xed exponent k>1. We emphasize that this is a We begin with a simpli®ed GRM variant (SGRM), de®ned by mild condition that is met in practically every realistic model

p selection problem: although there are many documented cir-

d; d=m= d+ d=m G Ã Ã . For this algorithm, we observe that we can avoid weakening Equation (7) to Equation cumstances in which we may wish to choose a model whose q complexity is on the order of the sample size, we do not

Ä Ä Ä Ä

d d=m ; d=m= d (8), because here G . Thus the imagine wanting to choose, for instance, a neural network Ä

dependence on d in the bound disappears entirely, resulting with a number of nodes exponential in the sample size. In o

in n any case, more general but more complicated assumptions

m d=m : d+ may be substituted for the notion of polynomial complexity, SGRM min opt 2 11 d and we discuss these in the full paper. This is not so mysterious, since SGRM penalizes strongly for

complexity (even more so than GRM). This bound expresses Theorem 2 Let M be any polynomial complexity model se-

fF g;f;D;L

the generalization error as the minimum of the sum of the best lection algorithm, and let d be any instance of

m m

F model selection. Let and denote the expected possible error within each class d and a penalty for com- M CV plexity. Such a bound seems entirely reasonable, given that generalization error of the hypotheses chosen by M and CV it is essentially the expected value of the empirical quantity respectively. Then

Ä p

we minimized to choose d in the ®rst place. Furthermore, if

m m+O m= m :

CV M 1 log 14

d+ d=m d

opt approximates well, then such a bound

is about the best we could hope for. However, there is no In other words, the generalization error of CV on m examples

reason in general to expect this to be the case. Bounds of this is at most the generalization error M on 1 examples,

m= m type were ®rst given by Barron and Cover [1] in the context plus the ªtest penalty termº O log .

of density estimation.

0 00

=S ;S

As an example of the application of Theorem 1 to MDL we Proof Sketch:LetS be a random sample of

0 00

jS j = m jS j = m

m examples, where 1 and .

m can derive the following bound on :

MDL k

d = m

Let max 1 be the polynomial bound on the

1 Ä 0 Ä

M h 2 F ;:::;h 2

fH m H d+ d=m

min opt complexity selected by ,andlet

MDL 1 d

1 max

0 0

q Ä

h F = LS ;d

d be determined by . By de®nition

max d

Ä 0

+Hd=mg + d =m

Ä Ä 00 Ä

d d = h g (12) f

MDL of CV, is chosen according to argmin ÃS .

d d

By standard uniform convergence arguments we have that Theorem 2 yields

0 0

Ä 00 Ä

jh h j = O m= m d d

m m max

ÃS log for all d

d CV SGRM 1

00 p

with high probability over the draw of S . Therefore with

d m= m

high probability + log MAX (16)

n o

d+ d= m

= fh g + O m= m :

min log 15 min opt 2 1

CV d

d m= m

But as we have previously observed, the generalization error + log MAX (17)

0 S

of any model selection algorithm (including M ) on input

d 0

Ä If we knew the form of opt (or even had bounds on it),

fh g

is lower bounded by mind , and our claim directly d follows. then in principle we could minimize the bound of Equation

(17) as a function of to derive a recommended training/test

Note that the bound of Theorem 2 does not claim CV split. Such a program is feasible for many speci®c problems

m M

M for all (which would mean that cross validation (such as the intervals problem), or by investigating general

d is an optimal model selection algorithm). The bound given

but plausible bounds on the approximation rate opt ,such

d c =d c > is weaker than this ideal in two important ways. First, and

as opt 0 for some constant 0 0. We pursue

perhaps most importantly, M 1 may be considerably this line of inquiry in some detail in the full paper. For now,

m larger than M . This could either be due to properties of we simply note that Equation (17) tells us that in cases for the underlying learning algorithm L, or due to inherent phase which the power law decay of generalization error within

transitions (sudden decreases) in the optimal information- F

each d holds approximately, the performance of CV will be theoretic learning curve [8, 3] Ð thus, in an extreme case, competitive with GRM or any other algorithm. This makes

it could be that the generalization error that can be achieved perfect sense in light of the preceding analysis of the two

F m within some class d by training on examples is close to 0, sources for additional CV error: in problems with power

but that the optimal generalization error that can be achieved law learning curve behavior, we have a power law bound

F =

m m in d by training on a slightly smaller sample is near 1 2. on M 1 M , and thus CV ªtracksº any other This is intuitivelythe worst case for cross validation Ð when algorithm closely in terms of generalization error. This is thesmall fraction of thesamplesaved for testing wascritically exactly the behavior observed in the experiments described needed for training in order to achieve nontrivialperformance in Section 4, for which the power law is known to hold Ð and is re¯ected in the ®rst term of our bound. Obviously approximately. the risk of ªmissingº phase transitions can be minimized by

decreasing the test fraction , but only at the expense of increasing the test penalty term, which is the second way in 7 Limitations on Penalty-Based Algorithms

which our bound falls short of the ideal. However, unlike the Recall that our experimental ®ndings suggested that it may

m m potentially unbounded difference M 1 M , sometimes be fair to think of penalty-based algorithms as our bound on the test penalty can be decreased without any being either conservative or liberal in the amount of coding problem-speci®c knowledge by simply increasing the test they are willing to allow in their hypothesis, and that bias in fraction . either direction can result in suboptimal generalization that is Despite these two competing sources of additional CV er- not easily overcome by tinkering with the form of the rule. In ror, the bound has some strengths that are worth discussing. this section we treat this intuition more formally, by giving a First of all, the bound holds for any model selection problem theorem demonstrating some fundamental limitations on the

diversity of problems that can be effectively handled by any

fF g;f;D;L

instance d . We believe that giving similarly general bounds for any penalty-based algorithm would be ®xed penalty-based algorithm. Brie¯y, we show that there

extremely dif®cult, if not impossible. The reason for this be- are (at least) two very different forms that the relationship be-

d d lief arises from the diversity of learning curve behavior doc- tween Ã and can assume, and that any penalty-based umented by the statistical mechanics approach [8, 3], among algorithm can perform well on only one of these. Further- other sources. In the same way that there is no universal more, for the problems we choose, CV can in fact succeed learning curve behavior, there is no universal behavior for on both. Thus we are doing more than simply demonstrating

that no model selection algorithm can succeed universally

d d the relationship between the functions Ã and Ðthe relationship between these quantities may depend critically for all target functions, a statement that is intuitively obvi- on the target function and the input distribution (this point ous. We are in fact identifying a weakness that is special to is made more formally in Section 7). CV is sensitive to this penalty-based algorithms. However, as we have discussed dependence by virtue of its target function-dependent and previously, the use of CV is not without pitfalls of its own.

We therefore conclude the paper in Section 8 with a summary

d distribution-dependent estimate of . In contrast, by their very nature, penalty-based algorithms propose a universal of the different risks involved with each type of algorithm,

and a discussion of our belief that in the absence of detailed

h penalty to be assigned to the observation of error Ã for a

problem-speci®c knowledge, our overall analysis favors the d hypothesis h of complexity . use of CV. A more technical feature of Theorem 2 is that it can be

combined with bounds derived for penalty-based algorithms Theorem 3 For any sample size m, there are model selection

1 2

g;f g;f fF ;D ;L fF ;D ;L

using Theorem 1 to suggest how the parameter should problem instances and d

d 1 1 2 2 L

be tuned. For example, letting M be the SGRM algorithm (where performs empirical error minimization in both m described in Section 5, and combining Equation (11) with instances) and a constant independent of such that

Ä Ä

d D ;G d; d=m G d ; d =m for any penalty-based model selection algorithm G, either For 0 0 Ã1 Ã1 2 2

1 2

f f m m dg + dg + d

mind or min . 21

G G 1 2

d d i 2f ; g G

Here i is the function for instance 1 2 , and from which it directly follows that in Problem 1, cannot

i Ä

m G d D

is the expected generalization error of algorithm choose 0 . By the second condition on Problem

G 1 0

i D

for instance . Thus, on at least one of the two model selec- 1 above, this implies that G 0 ; if we arrange that

D = cm c

tion problems, the generalizationerror of G is lower bounded 0 for some constant , then we have a constant lower

f dg G away from the optimal value min d by a constant in- bound on for Problem 1.

dependent of m. Due to space limitations, we defer the precise descriptions of

Proof Sketch: For notational convenience, in the proof we Problems 1 and 2 for the full paper. However, in Problem F

1 the classes d are essentially those for the intervals model

d d i 2f ; g i

use Ãi and ( 1 2 ) to refer to the expected values F of these functions. We start with a rough description of the selection problem, and in Problem 2 the d are based on properties of the two problems (see Figure 9): in Problem parity functions.

1, the ªrightº choice of d is 0, any additional coding directly We note that although Theorem 3 was designed to create two

results in larger generalization error, and the training error, model selection problems with the most disparate behavior

d d Ã1 , decays gradually with . In Problem 2, a large amount possible, theproof techniquecan beused to givelower bounds of coding is required to achieve nontrivial generalization er- on the generalization error of penalty-based algorithms under

ror, and the training error remains large as d increases until more general settings. In the full paper we will also argue

= m=

d 2, where the training error drops rapidly. that for the two problems considered, the generalization error

f dg i More precisely, we will arrange things so that the ®rst model of CV is in fact close to mind (that is, within a small selection problem (Problem 1) has the following proper- additive term that decreases rapidly with m)forboth prob-

lems. Finally, we remark that Theorem 3 can be strengthened

d ties (1) The function Ã1 lies between two linear func-

to hold for a single model selection problem (that is, a single

x tions with y -intercepts 1 and 1 1 1 and common -

function class sequence and distribution), with only the target

m m= d intercept 2 1 1 1 2; and (2) 1 is mini-

function changing to obtain the two different behaviors. This

= c mized at d 0, and furthermore, for any constant we

rules out the salvation of the penalty-based algorithms via

cm c= have 1 2. We will next arrange that the second model selection problem (Problem 2) will obey: (1) The problem-speci®c parameters to be tuned, such as ªeffective

dimensionº.

d= a d m m=

function Ã2 1 for 0 2 1 1 1 2,

>a d

where 1 1 1 1; and (2) The function 2 is lower

bounded by a1 for 0 2, but 2 2 0. In 8 Conclusions

Figure 9 we illustrate the conditions on Ã for the two

problems, and also include hypothetical instances of Ã1 Based on both our experimental and theoretical results, we

and Ã2 that are consistent with these conditions (and are offer the following conclusions:

d furthermore representative of the ªtrueº behavior of the Ã

functions actually obtained for the two problems we de®ne Model selection algorithms that attempt to reconstruct the

d d momentarily). curve solely by examining the curve Ã often have a tendency to overcode or undercode in their hypothesis We can now give the underlying logic of the proof using the for small sample sizes, which is exactly the sample size

d d d hypothetical Ã1 and Ã2 .Let 1 denote the complexity regime in which model selection is an issue. Such ten-

Ä d chosen by G for Problem 1, and let 2 be de®ned similarly. dencies are not easily eliminated without suffering the

First consider the behavior of G on Problem 2. In this prob- reverse tendency.

d G lem we know by our assumptions on 2 that if fails

Ä There exist model selection problems in which a hypothesis

d m= a

to choose 2 2, G 1, already giving a constant whose complexity is close to the sample size should be

lower bound on G for this problem. This is the easier case;

Ä chosen, and in which a hypothesis whose complexity is

thus let us assume that d2 2, and consider the behavior

d close to 0 shouldbe chosen, but that generate Ã curves

of G on Problem 1. Referring to Figure 9, we see that for with insuf®cient information to distinguish which is the

d D d d

0 0,Ã1 Ã2 , and thus case. The penalty-based algorithms cannot succeed in

d D ;G d; d=m G d; d=m For 0 0 Ã1 Ã2 18 both cases, whereas CV can. (because penalty-based algorithms assign greater penalties The error of CV can be bounded in terms of the error of any for greater training error or greater complexity). Since we other algorithm. The only cases in which the CV error

m= have assumed that d2 2, we know that may be dramatically worse are those in which phase

Ä Ä transitions occur in the underlying learning curves at a

;G d; d=m G d; d=m For d< m=2 Ã Ã 19

2 2 sample size larger than that held out for training by CV.

d D and in particular, this inequality holds for 0 0.On

d d d = the other hand, by our choice of Ã1 and Ã2 ,Ã1 2 Thus we see that both types of algorithms considered have

d = Ã2 2 0. Therefore, their own Achilles' Heel. For penalty-based algorithms, it

Ä Ä Ä Ä is an inability to distinguish two types of problems that call

d ; d =m=G d ; d =m : G Ã Ã 20

1 2 2 2 2 2 for drastically different hypothesis complexities. For CV,

m Combining thetwo inequalities above(Equation 18 and Equa- it is phase transitions that unluckily fall between 1

tion 19) with Equation 20, we have that examples and m examples. On balance, we feel that the evidence we have gathered favors use of CV in most common circumstances. Perhaps the best way of stating our posi- tion is as follows: given the general upper bound on CV 0.5 error we have obtained, and the limited applicability of any ®xed penalty-based rule demonstrated by Theorem 3 and the 0.4 experimental results, the burden of proof lies with the practitioner who favors an penalty-based algorithm over CV. In 0.3 other words, such a practitioner should have concrete evidence (experimental or theoretical) that their algorithm will 0.2 outperform CV on the problem of interest. Such evidence must arise from detailed problem-speci®c knowledge, since 0.1 we have demonstrated here the diversity of behavior that is

possible in natural model selection problems. 0 0 500 1000 1500 2000

Acknowledgements Figure 1: Experimental plots of the functions (lower curve

with local minimum), (upper curve with local minimum) and

d d We give warm thanks to Yoav Freund and Ronitt Rubinfeld for their Ã (monotonically decreasing curve) versus complexity for a

collaboration on various portions of the work presented here, and for target function of 100 alternating intervals, sample size 2000 and

= :

their insightful comments. Thanks to Sebastian Seung and Vladimir noise rate 0 2. Each data point represents an average over 10

d d Vapnik for interesting and helpful conversations. trials. The ¯attening of and occurs at the point where the noisy sample can be realized with no training error. References

[1] A. R. Barron and T. M. Cover. Minimum complexity den- 0.5 sity estimation. IEEE Transactions on Information Theory, 37:1034±1054, 1991.

0.4 [2] T. Cover and J. Thomas. Elements of Information Theory. Wiley, 1991. [3] D. Haussler,M. Kearns, H.S. Seung, and N. Tishby. Rigourous 0.3 learning curve bounds from statistical mechanics. In Proceed-

ings of the Seventh Annual ACM Confernce on Computational 0.2 Learning Theory, pages 76±87, 1994.

[4] J. R. Quinlan and R. L. Rivest. Inferring decision trees using 0.1 the minimum description length principle. Information and Computation, 80(3):227±248, 1989. 500 1000 1500 2000 2500 3000

[5] J. Rissanen. Modeling by shortest data description. Automat-

ica, 14:465±471, 1978. Figure 2: Experimental plots of generalization errors MDL

m (most rapid initial decrease), CV (intermediate initial decrease)

[6] J. Rissanen. Stochastic complexity and modeling. Annals of

m m and GRM (least rapid initial decrease) versus sample size for

Statistics, 14(3):1080±1100, 1986.

= : a target function of 100 alternating intervals and noise rate 0 20. [7] J. Rissanen. Stochastic Complexity in Statistical Inquiry,vol- Each data point represents an average over 10 trials. ume 15 of Series in Computer Science. World Scienti®c, 1989. [8] H. S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples. Physical Review, A45:6056±6091, 1992. 700

[9] M. Stone. Cross-validatory choice and assessment of statistical 600 predictions. Journalof the RoyalStatistical Society B, 36:111± 500 147, 1974. [10] M. Stone. Asymptotics for and against cross-validation. 400 Biometrika, 64(1):29±35, 1977. 300

[11] V.N. Vapnik. Estimation of Dependences Based on Empirical 200 Data. Springer-Verlag, New York, 1982. 100 [12] V.N. Vapnik and A. Y.Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. 0 500 1000 1500 2000 2500 3000 Theory of Probability and its Applications, 16(2):264±280, 1971.

m Figure 3: Experimental plots of hypothesis lengths dMDL

m (most rapid initial increase), dCV (intermediate initial increase)

m m

and dGRM (least rapid initial increase) versus sample size for

= : a target function of 100 alternating intervals and noise rate 0 20. Each data point represents an average over 10 trials. 1.06 0.5

1.04

0.4 1.02

1 0.3

0.98

0.96 0.2

0.94

0.1 0.92

0 500 1000 1500 2000

0 1000 2000 3000 4000 5000 6000

d + Hd=m

Figure 4: MDL total penalty H Ã versus com-

m Figure 7: Experimental plots of generalization error GRM plexity d for a single run on 2000 examples of a target function of

using complexity penalty multipliers 1.0 (slow initial decrease)

= : 100 alternating intervals and noise rate 0 20. There is a local

and 0.5 (rapid initial decrease) on the complexity penalty term

p =

minimum at approximately d 100, and the global minimum at

d=m + + dm=d m

the point of consistency with the noisy sample. 1 1 Ã versus sample size on a target of

= : 100 alternating intervals and noise rate 0 20. Each data point represents an average over 10 trials.

1.08

1.06

1.04

1.02 300 1

0.98 250

0.96

0.94 200

0.92 150 0.9

0.88 100

0.86 0 1000 2000 3000 4000 50

0 0 1000 2000 3000 4000 5000 6000

d + Hd=m Figure 5: MDL total penalty H Ã versus complex-

ity d for a single run on 4000 examples of a target function of 100

= :

alternating intervals and noise rate 0 20. The global minimum Ä

m has now switched from the point of consistency to the target value Figure 8: Experimental plots of hypothesis length dGRM using

of 100. complexity penalty multipliers 1.0 (slow initial increase) and 0.5

d=m +

(rapid initial increase) on the complexity penalty term 1

dm=d m

1 + Ã versus sample size on a target of 100 alternating

= : intervals and noise rate 0 20. Each data point represents an average over 10 trials.

1.8

TRAINING ERROR, 1.6 PROBLEM 1

1.4

1.2 TRAINING ERROR, PROBLEM 2 1

0.8

0.6

0.4

BEST COMPLEXITY, D_0 BEST COMPLEXITY, 0 500 1000 1500 2000 PROBLEM 1 PROBLEM 2

p Figure9: Figure illustrating the proof of Theorem 3. The dark lines

d+d=m + + dm=d

Figure 6: GRM total penalty Ã 1 1 Ã

d indicate typical behavior for the two training error curves Ã1 and

versus complexity d for a single run on 2000 examples of a target

d d

Ã2 , and the dashed lines indicate the provable bounds on Ã1 .

= : function of 100 alternating intervals and noise rate 0 20.