<<

To app ear in Pro ceedings of the Thirteenth International Conference

Bias Plus Decomp osition for

ZeroOne Loss Functions

Ron Kohavi David H Wolp ert

Data Mining and Visualization

Silicon Graphics Inc The Santa Fe Institute

N Shoreline Blvd Hyde Park Rd

Mountain View CA Santa Fe NM

ronnyksgicom dhwsantafeedu

Abstract Squared bias This quantity measures how closely

the learning algorithms average guess over all

p ossible training sets of the given training set size

We present a biasvariance decomp osition

matches the target

of exp ected misclassication rate the most

Variance This quantity measures howmuch the

commonly used loss function in sup ervised

learning algorithms guess b ounces around for

classication learning The biasvariance

decomp osition for quadratic loss functions the dierent training sets of the given size

is well known and serves as an imp ortant

to ol for analyzing learning algorithms yet

In addition to the intuitive insight the bias and vari

no decomp osition was oered for the more

ance decomp osition provides it has several other use

commonly used zeroone misclassication

ful attributes Chief among these is the fact that there

loss functions until the recentwork of Kong

is often a biasvariance tradeo Often as one mo di

Dietterich and Breiman

es some asp ect of the learning algorithm it will have

Their decomp osition suers from some ma

opp osite eects on the bias and the variance For ex

jor shortcomings though egpotentially

ample usually as one increases the numb er of degrees

negativevariance which our decomp osition

of freedom in the algorithm the bias shrinks but the

avoids We show that in practice the naive

variance increases The optimal numb er of degrees

based estimation of the decomp o

of freedom as far as exp ected loss is concerned is

sition terms is by itself biased and show

the numb er of degrees of freedom that optimizes this

how to correct for this bias We illustrate

tradeo b etween bias and variance

the decomp osition on various algorithms and

For classication the quadratic loss function is often

datasets from the UCI rep ository

inappropriate b ecause the class lab els are not numeric

In practice an overwhelming ma jority of researchers in

the Machine Learning community instead use exp ected

Intro duction

misclassication rate which is equivalent to the zero

one loss Kong Dietterich and Dietterich

The bias plus variance decomp osition Geman Bi

Kong recently prop osed a biasvariance decom

enensto ck Doursat is a p owerful to ol from

p osition for zeroone loss functions but their prop osal

theory statistics for analyzing sup ervised

suers from some ma jor problems such as the p ossibil

learning scenarios that have quadratic loss functions

ity of negativevariance and only allowing the values

As conventionally formulated it breaks the exp ected

t zero or one as the bias for a given test p oin

cost given a xed target and training set size into the

sum of three nonnegative quantities

In this pap er weprovide an alternative zeroone loss

decomp osition that do es not suer from these prob

Intrinsic target noise This quantityisalow er lems and that ob eys the desiderata that bias and vari

b ound on the exp ected cost of any learning algo ance should ob ey as discussed in Wolp ert submit

rithm It is the exp ected cost of the Bayesoptimal ted This pap er is exp ository b eing primarily con

classier cerned with bringing the zeroone loss biasvariance

decomp osition to the attention of the Machine Learn The training set d is a set of m pairs of xy values

ing community Wedonotmakeany assumptions ab out the distribu

tion of pairs In particular our mathematical results

After presenting our decomp osition wedescribeaset

do not require them to b e generated in an iid in

of exp eriments that illustrate the eects of bias and

dep endently and identically distributed manner as

variance for some common induction algorithms We

commonly assumed

also discuss a practical problem with estimating the

quantities in the decomp osition using the naiveap To assign a p enalty to a pair of values y and y we

F H

proach of frequency counts the frequencycount esti use the loss function Y Y R In this pap er we

mators are biased in a way that dep ends on the train consider the zeroone loss function dened as

ing set size We showhow to correct the so

that they are unbiased

y y y y

F H F H

where y y if y y and otherwise

Denitions

F H F H

The cost C is a realvalued dened

We use the following notation

as the loss over the random variables Y and Y So

F H

the exp ected cost is

The underlying spaces

X

E C y y P y y

H F H F

Let X and Y b e the input and output spaces resp ec

y y

H F

tively with cardinalities jX j and jY j and elements x

and y resp ectively To maintain consistency with

For zeroone loss the cost is usually referred to as

planned extensions of this pap er we assume that b oth

misclassication rate and is derived as follows

X and Y are countable However this assumption is

not needed for this pap er provided all sums are re

X

y y P y y E C

H F H F

placed byintegrals In classication problems Y is

y y

H F

usually a small nite set

X

P Y Y y

H F

The target f is a conditional

y Y

P Y y j x where Y is a Y valued random

F F F

variable Unless explicitly stated otherwise we assume

that the target is xed As an example if the target

The notation used here is a simplied version of

is a noisefree function from X to Y forany xed x

the extended Bayesian formalism EBF describ ed in

wehave P Y y j xforonevalue of y and

F F F

Wolp ert In particular the results of this pa

for all others

p er do not dep end on howthe X values in the test set

are determined so there is no need to dene a random

The hyp othesis h generated by a learning algorithm

variable for those X values as is done in the full EBF

is a similar distribution P Y y j x where Y

H H H

is a Y valued random variable As an example if the

hyp othesis is a singlevalued function from X to Y as

Bias Plus Variance for ZeroOne

it is for many classiers eg decision trees nearest

Loss

neighb ors then P Y y j x for one value of

H H

y and for all others

H

Wenow showhow to decomp ose the exp ected cost

We will drop the explicitly delineated random variables

into its comp onents and then provide geometric views

from the probabilities when the context is clear For

of this decomp osition in particular relating it to

example P y will b e used instead of P Y y

H H H

quadratic loss in Euclidean spaces

Prop osition Y and Y areconditional ly indepen

F H

The Decomp osition

dent given f and a test point x

We present the general result involving the exp ected

Proof P y y j f x P y j y fxP y j f x

F H F H H

zeroone loss E C where the implicit conditioning

P y j f xP y j f x

F H

event is arbitrary Then we sp ecialize to the stan

The last equality is true b ecause by denition y

F

dard conditioning used in conventional sampling the

dep ends only on the target f and the test p oint x

ory statistics a single test p oint target and training

set size to b e explicit To b etter understand these formulas

note that P Y y j x is the probability after any

X

F

E C P Y Y y From equation

noise is taken into account that the xed target takes

H F

on the value y at p oint xTo understand the quantity

y Y

P Y y j x one must write it in full as

X X

H

P Y Y y P Y y P Y y

H F H F

P Y y j f m x

H

y Y y Y

X

h

X

P d j f m xP Y y j d f m x

H

P Y y P Y y P Y y

F H F

d

y Y

X

i

P d j f mP Y y j d x

H

P Y y

H

d

In this expression P d j f m is the probability

X X

P Y y P Y y

H F

of generating training set d from the target f and

y Y y Y

P Y y j d x is the probability that the learning

H

algorithm makes guess y for p oint x in resp onse to the

training set dSoP Y y j xistheaverage over

H

Rearranging the terms wehave E C

training sets generated from f Y value guessed by the

X

P Y y P Y y

H F

learning algorithm for p oint x

y Y

Note that while the quadratic loss decomp osition in

P Y Y y

F H

volves quadratic terms and the log loss decomp o

X

sition involves logarithmic terms Wolp ert submit

P Y y P Y y bias

F H

ted our zeroone loss decomp osition do es not involve

y Y

Kronecker delta terms but rather involves quadratic

X

terms

P Y y variance

H

y Y

Our denitions of bias variance and noise

ob ey some appropriate desiderata including

X

P Y y

F

y Y

The bias term measures the squared dierence

between the targets average output and the al

gorithms average output It is a realvalued non

In this pap er we are interested in E C j f m the

negative quantity and equals zero only if P Y

exp ected cost where the target is xed and one aver

F

ages over training sets of size m One waytoevaluate

y j x P Y y j xforallx and y These

P

H

this quantity is to write it as P xE C j f m x

x

prop erties are shared bybias for quadratic loss

and then use Equation to get E C j f m x By

The variance term measures the variability

Prop osition y and y are indep endent when one

H F

over Y ofP Y j x It is a realvalued non

conditions on f and x hence the covariance term

H H

vanishes So

negative quantity and equals zero for an algorithm

X

that always makes the same guess regardless of

E C P x bias variance

x

x x

the training set eg the Bayes optimal classi

x

er As the algorithm b ecomes more sensitiveto

where

changes in the training set the variance increases

X

Moreover given a distribution over training sets

bias P Y y j x P Y y j x

F H

x

the variance only measures the sensitivity of the

y Y

learning algorithm to changes in the training set

X

and is indep endent of the underlying target This

variance P Y y j x

x H

prop erty is shared byvariance for quadratic loss

y Y

The noise measures the variance of the target in

X

that the denitions of variance and noise are iden

P Y y j x

F

x

tical except for the interchange of Y and Y In

F H

y Y

addition the noise is indep endent of the learning

algorithm This prop erty is shared by noise for

To simplify the exp osition the f and m in the con

ditioning events are still implicit even though x needs quadratic loss

In contrast to our denition of bias the denitions Relation to the Quadratic Decomp osition

of bias for a xed target and a given instance sug

To relate the zeroone loss decomp osition to the more

gested in Kong Dietterich Dietterich Kong

familiar quadratic loss decomp osition let Y be an

F

and Breiman are only twovalued for

jY j

R valued random variable restricted so that exactly

binary classication they cannot quantify subtler lev

one of its comp onents equals and all others equal

els of mismatchbetween a learning algorithm and a

that single valued comp onent is the one with index

target However their decomp ositions havethead

y SoY is Y reexpressed as a vector on the unit

F F F

vantage that their bias is zero for the Bayes optimal

hyp ercub e Dene Y similarly in terms of Y

H H

classier while ours may not b e

Since Y and Y are realvalued we can dene

F H

The ma jor distinction b etween the decomp ositions

quadratic loss over them In particular recalling that

arises in the variance term All of the desiderata for

squaring a vector is taking its dot pro duct with itself

variance listed ab ove are violated by the decomp osi

wehave

tions prop osed by Kong Dietterich Diet

terich Kong and Breiman Sp eci

if y y

F H

y y

cally in their denitions the variance can b e negative F H

if y y

F H

and is not minimized by a classier that ignores the

training set Kong Dietterich note this short

Accordingly

coming explicitly The following examples illustrates

the phenomenon for the decomp osition suggested by

i h

Breiman Assume a noise free target with

Y Y j f m x P y y j f m x E

F H H F

heads and tails Consider an x for which the tar

E C j f m x

loss

get has the value tails the average error of a ma jority

classier will b e slightly ab oveyet the probabil

So by transforming the Y to a vector of indicator

ity of error for the aggregate ma jority classier will

variables we see that the exp ected cost for zeroone

b e one This causes the variance to b e negative

loss is one half the asso ciated quadratic loss

Another advantage of our decomp osition is that its

terms are a of the target An

Exp erimental Metho dology

innitesimal change in the target whichchanges the

class most commonly predicted by the learning algo

We b egin with a description of our exp erimental

rithm for a given x will not cause a large change in

metho dology and then discuss a problem with the

our bias v ariance or noise terms In contrast the

naive estimation of the terms in our decomp osition by

other denitions of bias and variance do not share this

using frequency counts

prop erty

Our frequency counts exp eriments

The BiasVariance Decomp osition in

Toinvestigate the b ehavior of the terms in our decom

Vector Form

p osition we ran a set of exp eriments on UCI rep ository

We can rewrite Equation in vector notation that

Murphy Aha In each of those exp eriments

may give a b etter geometrical interpretation to the

for a given dataset and a given learning algorithm we

decomp osition Dene F P y H P y and

estimated the xaverage of bias variance intrinsic

F H

jY j

noise and overall error as follows

FH P y y F and H are vectors in R with

F H

their comp onents indexed by Y values FH is a matrix

jY j jY j

in R R Ifwe denote dotpro ducts byand

We randomly divided each dataset into two parts

the dotpro duct of a vector with itself by squaring it

D and E D was used as if it were the world

then

from whichwe sample training sets while E was

used to evaluate the terms in the decomp osition

covariance TrFH F H Tr is the trace

This idea is similar to the b o otstrap world idea

in Efron Tibshirani Chapter

 

F H H and bias variance

We generate N training sets from D each gen

erated using uniform random sampling without



F

replacement Note that our decomp osition do es

not require the training sets to b e sampled in any each set of n training sets and then averaging minus

sp ecic manner Section To get training sets estimating based on the full set of N n training

of size mwechose D to b e of size m This allows sets

m

for dierent p ossible training sets enough

m

to guarantee that we will not get many duplicate

P P

 

T w T w E

i i

training sets in our set of N training sets even for

i i

small values of m

P P

 

E T E T w w

We ran the learning algorithm on eachofthe i i

i i

training sets and estimated the terms in Equa

tion and using the generated classier for each

By Jensens inequality Cover Thomas the

p oint x in the evaluation set E Equation was

rst term on the right hand side is larger or equal to

used to estimate py jf m x At rst all these

H

the second term This shows that when weaverage

terms were estimated using frequency counts

once over n instances rather than twice over n in

stances we get a smaller estimate for the bias which

establishes the prop osition A similar argumentholds

Figure left shows the estimate for bias for dierent

for variance which gets larger as N grows

values of N when ID Quinlan was executed on

three datasets from the UCI rep ository It is clear that

our estimate of bias using frequency counts shrinks as

Unbiased estimators of bias and variance

we increase N Since innite N gives the correct value

One way to correct these biases in our frequency counts

of bias this that for any nite N the bias

estimators is to use a very large numb er of training

estimate is always itself biased upwards The variance

sets Although suchanapproachwould have b een

term exhibits the opp osite b ehavior it is always biased

feasible for our exp eriments it is not in general for

lo w The right hand side of Figure shows the b ehav

manycombinations of learning problem and learning

ior with the corrected estimators we prop ose b elowin

algorithm the asso ciated computational requirements

subsection

are prohibitive

To see why this biasing b ehavior arises x y Y

Fortunately there is a straightforward correction we

and x X We will demonstrate that the bias of our



can apply to our estimators to make them unbiased

frequencybased of bias holds for anysuch

Dene w as the frequency count estimate for P Y

y and x which immediately implies that it holds when

N H

y jf m x based on the N training sets Dene U

one averages over y and x Assume N is even and de

h N

w T where T was dened ab oveasP Y y j x

ne n N Given a set of N training sets one

N F

The frequency counts estimate of bias we used b efore

option is to estimate bias using two distinct sets of

was U our prop osed estimator is V U w

n training sets and to then average their estimates to

N N

N N

w N

get an average ntrainingsetbased estimate Another

N

option is to average over all N training sets directly to

Prop osition Under the assumption that we know

get an N trainingsetbased estimate What wewill

T s value V is an unbiased estimator for bias

show is that averaged over sets of N n training

N

sets the former strategy results in a higher estimate

Proof Since bias E U j n T wehavetoshow

of bias than the latter the exp ected value of the es N

that

timate of bias using n training sets is larger than the

exp ected value using n training sets Since the esti

j N T E U j n T E U w w N

N N N

N

mate using innite sets is exactly correct this means

that our estimate of bias is biased upwards more as

fewer training sets are used



It is not obvious that one would wantanunbiased es

Let i f g sp ecify one of the twosetsofn training

timator for the estimated quantities since that do es not

sets were examining Dene w to b e the average over necessarily minimize exp ected error That is what the

i

biasvariance tradeo is all ab out However we felt that

the training sets in set i of P Y y j x ie w

H i

P

getting an unbiased estimator would help understand the

P Y y j x d n where d is the j th training

H ij ij

j

problem b etter and exp eriments we did indicated that the

set in set i Let T be P Y y j x see Equation

F

exp ected squarederror loss for our prop osed unbiased es

We can now compute the exp ected dierence b etween

timator is smaller than that of the estimator based on the

the twoways of estimating the bias estimating for naive use of frequency counts 0.4 0.4 Led24 Led24 0.35 Satimage 0.35 Satimage 0.3 Soybean 0.3 Soybean 0.25 0.25 0.2 0.2

Bias^2 0.15 Bias^2 0.15 0.1 0.1 0.05 0.05 0 0 1 10 100 1 10 100

Internal replicates (logscale) Internal replicates (logscale)

Figure The uncorrected bias left and corrected right

A related p oint is that is very dicult to es An unbiased estimator for the variance of the probabil

timate in practice Using a frequency count es ityinanN sample Bernoulli pro cess is the empirical

timator the estimate of would b e zero if all variance multiplied by NN Furthermore b e

instances are unique regardless of the true cause T is xed the variance of U is equal to the

N

Since in the UCI datasets almost all instances variance of w Thus wehave

N

are unique we elected to dene to b e zero by

considering all instances to b e unique This can

var U jT nvar w jn w w N

N N N N

b e viewed as a calculational convenience since we

are only concerned with the variation in exp ected

But by denition of variance wehave

cost as wevary the learning algorithm and the

varU jT nE U jT n E U jT n

N N

N

estimate of is algorithmindep endent

Equating the right sides of Equations and wehave

There is a p ossibility that V will give a negative

N

the desired result

estimate of bias This is to b e exp ected given

that V is unbiased If the true bias equals zero

N

The correction to the variance estimator is simply

for an estimator with variance greater than zero

given by negative the correction to the bias estimator

to pro duce zero as its average estimate as it must

This can b e shown using reasoning similar to the pro of

if it is an unbiased estimator it must sometimes

just ab ove but it also follows from the biasvariance

pro duce negativenumb ers

decomp osition itself and the fact that the frequency

This p otentially negative estimate should b e con

counts estimator of error gives an unbiased estimate

trasted with the negativevariances accompany

of the zeroone loss

ing the biasvariance decomp osition in Kong

It is imp ortant to realize that Prop osition implicitly

Dietterich and Breiman Unlike

assumes that training sets are formed by iid sam

in their decomp osition in our decomp osition the

pling a trainingsetgenerating pro cess This is true for

true bias and variance are always nonnegative

our exp erimental metho dology even though each run

the p otential for a negativevalue is a reection

ning of our trainingsetgenerating pro cess is required

of the more problematic asp ects of unbiased es

to pro duce training sets that contain no duplicate x y

timators rather than of the underlying quantity

pairs for the Prop osition to apply it suces to allow

b eing estimated In practice though we always

duplicate training sets

had enough so that negative bias estimates

never o ccurred

We conclude this section with a few comments

The Exp eriments

The assumption underlying Prop osition that we

know T exactly is false since we can only estimate Wenow present exp eriments demonstrating the bias

it from the data However our estimate of T is a and variance of induction algorithms for datasets from

constant indep endent of the numb er of training the UCI rep ository Murphy Aha The

sets or the details of the learning algorithm So datasets were chosen so that they contain at least

errors in our estimate of T are not imp ortant instances to ensure accurate estimates of error and 0.6000 0.5000

0.4000 0.3000 Error 0.2000

0.1000 0.0000 DNA Led24 Chess Anneal Soybean Segment Satimage Tic-tac-toe Hypothyroid

NN-1 bias NN-1 var NN-3 bias NN-3 var NN-5 bias NN-5 var

Figure The bias plus variance decomp osition for nearestneighb or with varying numberofneighb ors The

notation NNk indicates vote among the k closest neighbors

In Anneal Chess and Tictacto e increasing the num

Table The Datasets and their characteristics

ber of neighb ors increases the bias and decreases the

Dataset No Dataset Trainset

variance however the bias increases muchmorethan

features Size size

the decrease in variance and the overall error increases

Anneal

In Anneal the bias more than doubles when going

Chess

from one neighb or to three neighb ors In Tictacto e

DNA

the variance drastically decreases it is for ve

LED

Hyp othyroid

neighb ors The reason for this decrease in variance

Segment

is that instances in Tictacto e vary on one to nine

Satimage

squares allowing neighb ors with up to ve diering

Soyb eanlarge

squares causes NN to let a large p ortion of the space

Tictacto e

vote in eect predicting the ma jority class a constant

win for X all the time

In DNA and Led b oth bias and variance decrease

to demonstrate the of p otential biasvariance b e

as the numb er of neighb ors increases For DNA bias

haviors Table shows a summary of the datasets we

shrinks by and the variance by for led

use We use small training sets to make sure the eval

bias shrinks by and the v ariance by

uation set is large In general wechose size for

datasets with less than instances and for

In Segment and Soyb ean b oth the bias and variance

those with over instances except DNA which

increase as the numb er of neighb ors is increased We

has features We generated training sets for

have observed a general increase in variance when the

each learning algorithm ie N and use the

numb er of classes is large Segmenthasseven classes

unbiased estimators discussed ab ove

and Soyb ean has classes Increasing the number of

neighb ors causes many ties whicharebroken arbitrar

ilythus increasing the variance

Varying the Number of Neighb ors in

NearestNeighbor

In Hyp othyroid and Satimage the changes in bias

and variance as the numb er of neighb ors is changed

Figure shows the biasvariance decomp osition of the

are small and almost cancel For Hyp othyroid increas

error for the nearestneighb or algorithm with a varying

ing from one to three neighb ors increases bias from

numberofneighb ors used 0.50 0.40

0.30

Error 0.20

0.10 0.00 DNA Led24 Chess Anneal Soybean Segment Satimage Tic-tac-toe Hypothyroid

ID3-bias ID3-var Agg .9-bias Agg .9-var Agg .7-bias Agg .7-var

Figure The bias and variance of ID and aggregated ID with and of the training set Aggregation

increases the bias slightly but stabilizes ID thus reducing the variance more

to and decreases the variance from gregation techniques Breiman b Note also that

to For Satimage bias increases from to the smaller the internal sample the more bias wepo

and the variance decreases from to tentially add Gordon Olshen but the more

dierent the classiers will b e leading to a more sta

Although most datasets exhibit a biasvariance trade

ble av erage

o where one quantity go es up and the other go es down

as a parameter of the induction algorithm is varied we Our results show that in this voting scheme the reduc

can see examples where b oth change in the same di tion in error is almost solely due to the reduction in

rection variance While the bias go es up slightly esp ecially as

smaller training sets are used samples of size the

reduction in variance is signicant and the total error

Combining Classiers

usually decreases For our datasets voting samples of

size always reduced the error Voting samples of

There has b een a lot of work recently on combin

size reduced the error even more in all datasets but

ing classiers with the terms aggregation averages

one anneal

ensembles classier combinations voting and stack

ing commonly used Wolp ert Breiman a

Summary and Future Work

Perrone Ali In the simplest scheme mul

tiple classiers are generated and then vote for each

test instance with the ma jority prediction used as the We presented a bias and variance decomp osition for

nal prediction misclassication error which is equivalenttozeroone

loss and showed that it ob eys some desired criteria

Figure shows ID versus a combination of ag

This decomp osition do es not suer from the short

gregated trees The trees are generated by rep eat

comings of the decomp ositions suggested byKong

edly sampling a subset of the training set and running

Dietterich and Breiman b Weshowed

ID In contrast to bagging as dened in Breiman

how estimating the terms in the decomp osition us

a the samples used here were generated byuni

ing frequency counts leads to biased estimates and ex

form sampling without replacement Two sample sizes

plained how to get unbiased estimators whichwelater

are shown training sets with of the training set

used We then gave some examples of the biasvariance

and with Minor variations in the training set

tradeo using twomachine learning algorithms and

may cause a dierent split which mightchange the

several UCI datasets

whole subtree As a consequence decision trees are

very unstable and therefore they usually gain byag In the future we plan to investigate many extensions

of this work In particular we plan to investigate the University of California BerkeleyAvailable at

following topics httpwwwstatBerkeleyEDUusersbreiman

Cover T M Thomas J A Elements of

The decomp osition in Equation holds for essen

Information Theory John Wiley Sons Inc

tially any conditioning event It even holds if the

Dietterich T G Kong E B Machine

conditioning eventisd x as in Bayesian p oint

learning bias statistical bias and statistical vari

estimation wehaveabiasvariance decomp osi

ance of decision tree algorithms Technical rep ort

tion for a Bayesian quantity The decomp osition

Department of Computer Science Oregon State

also holds if x is not in the conditioning event

University

so that no external average over x is required as

Efron B Tibshirani R An Introduction to

it is in this pap er Weintend to explore these

the Bootstrap Chapman Hall

alternative conditioning events

Geman S Bienensto ck E Doursat R

Since variance can b e directly estimated using

Neural networks and the biasvariance

the unlab elled instances of the test set estimat

dilemma Neural Computation

ing overall error of a learning algorithm hinges

on estimating its bias Wewould like to test

Gordon L Olshen R Almost sure consis

whether b etter error estimates can b e obtained by

ten t from recursivepar

using crossvalidationstyle partitions of the train

titioning schemes Journal of Multivariate Anal

ing set to estimate that bias and using unlab elled

ysis

instances from the test set to estimate the vari

Kong E B Dietterich T G Error

ance

correcting output co ding corrects bias and vari

ance in A Prieditis S Russell eds Machine

The bias and variance decomp osition is an extremely

Learning Pro ceedings of the Twelfth Interna

useful to ol in Statistics that has rarely b een utilized

tional Conference Morgan Kaufmann Publish

b efore in Machine Learning b ecause no decomp osition

ers Inc pp

existed for misclassication rate We hop e that the

MurphyP M Aha D W UCI rep ository

prop osed decomp osition will overcome this problem

of machine learning databases

and thereby help us improve our understanding of su

httpwwwicsuciedumlearn

p ervised learning

Perrone M Improving regression estimation

averaging metho ds for variance reduction with ex

Acknowledgments

tensions to general convex measure optimization

PhD thesis Brown UniversityPhysics Dept

We thank Jerry Friedman and Tom Dietterich for

many discussions on the biasvariance tradeo The

Quinlan J R Induction of decision trees

rst author thanks Silicon Graphics Inc The second

Machine Learning ReprintedinShav

author thanks the Santa Fe Institute and TXN Inc

lik and Dietterich eds Readings in Machine

Learning

References

Wolp ert D submitted On bias plus variance SFI

TR in

Ali K M Learning Probabilistic Relational

ftpbiasplusps ftpsantafeedupubdhw

Concept Descriptions PhD thesis Universityof

Wolp ert D H Stacked generalization Neu

California Irvine httpwwwicsucieduali

ral Networks

Breiman L a Bagging predictors Technical

Wolp ert D H The relationship b etween PAC

Rep ort Statistics Department University of Cal

the statistical physics framework the Bayesian

ifornia at Berkeley

framework and the VC framework in D H

Breiman L b Heuristics of instabilityinmodel

Wolp ert ed The Mathemtatics of Generaliza

selection Technical Rep ort Statistics Depart

tion Addison Wesley

ment University of California at Berkeley

Breiman L Bias variance and arcing clas

siers Technical rep ort Statistics Department