An Experimental and Theoretical Comparison
of Model Selection Methods z Michael Kearns Yishay Mansoury Andrew Y. Ng Dana Ron AT&T Bell Laboratories Tel Aviv University Carnegie Mellon University Hebrew University Murray Hill, New Jersey Tel Aviv, Israel Pittsburgh, Pennsylvania Jerusalem, Israel
1 Introduction selection algorithm (perhaps based in part on some known properties of the model selection problem being confronted). In the model selection problem, we must balance the com- The summary of the paper follows. In Section 2, we provide a plexity of a statistical model with its goodness of ®t to the formalization of the model selection problem. In this formal- training data. This problem arises repeatedly in statistical es- ization, we isolate the problem of choosing the appropriate timation, machine learning, and scienti®c inquiry in general. complexity for a hypothesis or model. We also introduce the Instances of the model selection problem include choosing speci®c model selection problem that will be the basis for the best number of hidden nodes in a neural network, de- our experimental results, and describe an initial experiment termining the right amount of pruning to be performed on demonstrating that the problem is nontrivial. In Section 3, we a decision tree, and choosing the degree of a polynomial ®t introduce the three model selection algorithms we examine to a set of points. In each of these cases, the goal is not to in the experiments: Vapnik's Guaranteed Risk Minimization minimize the error on the training data, but to minimize the (GRM) [11], an instantiation of Rissanen's Minimum De- resulting generalization error. scription Length Principle (MDL) [7], and Cross Validation Many model selection algorithms have been proposed in the (CV). literatureof several different research communities, too many Section 4 describes our controlled experimental comparison to productively survey here. (A more detailed history of the of the three algorithms. Using arti®cially generated data from problem will be given in the full paper.) Perhaps surprisingly, a known target function allows us to plot complete learning despite the many proposed solutions for model selection and curves for the three algorithms over a wide range of sample the diverse methods of analysis, direct comparisons between sizes, and to directly compare the resulting generalization er- the different proposals (either experimental or theoretical) ror to the hypothesis complexity selected by each algorithm. are rare. It also allows us to investigate the effects of varying other nat- The goal of this paper is to provide such a comparison, ural parameters of the problem, such as theamount of noise in and more importantly, to describe the general conclusions to the data. These experiments support the followingassertions: which it has led. Relying on evidence that is divided between the behavior of the algorithms examined can be complex and controlled experimental results and related formal analysis, incomparable, even on simple problems, and there are fun- we compare three well-known model selection algorithms. damental dif®culties in identifying a ªbestº algorithm; there We attempt to identify their relative and absolute strengths is a strong connection between hypothesis complexity and and weaknesses, and we provide some general methods for generalization error; and it may be impossible to uniformly analyzing the behavior and performance of model selection improve the performance of the algorithms by slight mod- algorithms. Our hope is that these results may aid the in- i®cations (such as introducing constant multipliers on the formed practitioner in making an educated choice of model complexity penalty terms).
This research was done while Y. Mansour, A. Ng and D. Ron In Sections 5, 6 and 7 we turn our efforts to formal results were visiting AT&T Bell Laboratories. providing explanation and support for the experimental ®nd- y Supported in part by The Israel Science Foundation, adminis- ings. We begin in Section 5 by upper bounding the error of tered by The Israel Academy of Science and Humanities, and by a any model selection algorithmfallingintoa wideclass (called grant of the Israeli Ministry of Science and Technology. penalty-based algorithms) that includes both GRM and MDL z Supported by the Eshkol Fellowship, sponsored by the Israeli (but not cross validation). The form of this bound highlights Ministry of Science. thecompeting desires for powerful hypotheses and controlled complexity. In Section 6, we upper bound the additional er- ror suffered by cross validation compared to any other model selection algorithm. This quality of this bound depends on the extent to which the function classes have learning curves obeying a classical power law. Finally, in Section 7, we give an impossibility result demonstrating a fundamental handi- cap suffered by the entire class of penalty-based algorithms the training error. If we are lucky, we have in our possession
that does not af¯ict cross validation. In Section 8, we weigh a (possibly randomized) learning algorithm L that takes as d the evidence and ®nd that it provides concrete arguments fa- input any sample S and any complexity value , and outputs
Ä
h VS d
voring the use of cross validation (or at least cause for caution amember d of (using some unspeci®ed criterion to
VS dj >
in using any penalty-based algorithm). break ties if j 1). More generally, it may be the
d case that ®nding any function in VS is intractable, and
that L is simply a heuristic (such as backpropagation or ID3)
Ä
h 2 F d
2 De®nitions that does the best job it can at ®nding d with small d training error on input S and . In either case, we de®ne
Ä Ä
= LS; d d= d h
Throughout the paper we assume that a ®xed boolean target h
L;S d
d and à à à . Note that we
f
d d function is used to label inputs drawn randomly accord-
expect à , like opt , to be a non-increasing function of D
ing to a ®xed distribution . For any boolean function d Ð by going to a larger complexity, we can only reduce our
h h= h
,wede®nethegeneralization error f;D training error.
Pr [hx 6= f x] S
2D x .Weuse to denote the random vari-
We can now give a precise statement of the model selec-
S = hx ;b i;:::;hx ;b i m m able 1 1 m ,where is the sample size,
tion problem. First of all, an instance of the model se- x
each i is drawn randomly and independently according to
fF g;f;D;L
lection problem consists of a tuple d ,where
D b = f x c c 2f ; g
i i i
,and i , where the noise bit 0 1 is 1
fF g f
d is the hypothesis function class sequence, is the tar-
2 [ ; =
with probability ; we call 0 1 2 the noise rate.Inthe L
get function, D is the input distribution, and is the un-
6= case that 0, we will sometimes wish to discuss the gen- derlying learning algorithm. The model selection problem
eralization error of h with respect to the noisy examples, so
is then: Given the sample S , and the sequence of func-
c h Pr [hx 6= f x c]
2D;c we de®ne x ,where is the
Ä Ä
h = LS; ;:::;h = LS; d;:::
tions 1 1 d determined by
h h noise bit. Note that and are related by the equality
Ä
L d
h= h+ h = h+ 1 1 1 2 .For the learning algorithm , select a complexity value such Ä that h Ä minimizes the resulting generalization error. Thus, a
simplicity, we will use the expression ªwith high probabilityº d
S
S
to mean with probability 1 over the draw of ,atacost model selection algorithm is given both the sample and the
L
= of a factor of log1 in the bounds Ð thus, our bounds sequence of (increasingly complex) hypotheses derived by all contain ªhiddenº logarithmic factors, but our handling from S , and must choose one of these hypotheses. of con®dence is entirely standard and will be spelled out in The current formalization suf®ces to motivatea key de®nition the full paper. and a discussion of the fundamental issues in model selection.
Ä
d= d h d d
We assume a nested sequence of hypothesis classes (or mod- We de®ne L;S . Thus, is a random
S
F f
F variable (determined by the random variable )thatgives els) d . The target function may or
1 Ä
h L
may not be contained in any of these classes, so we de®ne the true generalization error of the function d chosen by
F d
d
h d h hg
f from the class . Of course, is not directly accessible
opt d
d argmin and (similarly,
h2F
d
to a model selection algorithm; it can only be estimated or
d h h d
). Thus, d is the best approximation to opt
guessed in various ways from the sample S .Asimplebut
f D F d opt (with respect to ) in the class d ,and measures
important observation is that no model selection algorithm
d
the quality of this approximation. Note that opt is a non-
fdg
can achieve generalization error less than min d . Thus
increasing function of d since the hypothesis function classes
d the behavior of the function Ð especially the location are nested. Thus, larger values of d can only improve the potential approximative power of the hypothesis class. Of and value of its minimum Ð is in some sense the essential course, the dif®culty is to realize this potential on the basis quantity of interest in model selection.
of a small sample. The prevailing folk wisdom in several research communities
d With this notation, the model selection problem can be stated posits that will typically have a global minimum that is
nontrivial Ð that is, at an ªintermediateº value of d away
informally: on the basis of a random sample S of a ®xed size
= d m
Ä from the extremes d 0and . As a demonstration d m, the goal is to choose a hypothesis complexity ,andahy-
Ä of the validity of this view, and as an introduction to a par-
2 F pothesis h Ä, such that the resulting generalization error
Ä d ticular model selection problem that we will examine in our
h is minimized. In many treatments of model selection, experiments, we call the reader's attention to Figure 1. In including ours, it is explicitly or implicitly assumed that the this model selection problem (which we shall refer to as the model selection algorithm has control only over the choice Ä intervals model selection problem), the input domain is sim-
of the complexity d, but not over the choice of the ®nal hy-
[ ; ] F
Ä ply the real line segment 0 1 , and the hypothesis class d is
2 F
pothesis h Ä. It is assumed that there is a ®xed algorithm
d
; ] simply the class of all boolean functions over [0 1 in which
that chooses a set of candidate hypotheses, one from each
d F
we allow at most alternationsof label; thus d is the class of hypothesis class. Given this set of candidate hypotheses, the
all binary step functions with at most d=2 steps. For the ex- model selection algorithm then chooses one of the candidates
periments, the underlying learning algorithm L that we have as the ®nal hypothesis. implemented performs training error minimization. This is a
To make these ideas more precise, we de®ne the training rare case where ef®cient minimization is possible; we have
h= h jfhx ;b i2S hx 6= b gj=m
developed an algorithm based on dynamic programming that
i i i i
error à ÃS : ,and
d=VS d fh 2 F h=
VS runs in linear time, thus making experiments on large sam- d
the version space S :Ã
0
S
0
f h gg VS d F
ples feasible. The sample was generated using the target
2F d
min h à . Note that may contain more
d
F [ ; ]
F function in that divides 0 1 into 100 segments of equal than one function in d Ð several functions may minimize 100