Bias Plus Variance Decomposition for Zero-One Loss Functions

To app ear in Machine Learning Pro ceedings of the Thirteenth International Conference Bias Plus Variance Decomp osition for ZeroOne Loss Functions Ron Kohavi David H Wolp ert Data Mining and Visualization Silicon Graphics Inc The Santa Fe Institute N Shoreline Blvd Hyde Park Rd Mountain View CA Santa Fe NM ronnyksgicom dhwsantafeedu Abstract Squared bias This quantity measures how closely the learning algorithms average guess over all p ossible training sets of the given training set size We present a biasvariance decomp osition matches the target of exp ected misclassication rate the most Variance This quantity measures howmuch the commonly used loss function in sup ervised learning algorithms guess b ounces around for classication learning The biasvariance decomp osition for quadratic loss functions the dierent training sets of the given size is well known and serves as an imp ortant to ol for analyzing learning algorithms yet In addition to the intuitive insight the bias and vari no decomp osition was oered for the more ance decomp osition provides it has several other use commonly used zeroone misclassication ful attributes Chief among these is the fact that there loss functions until the recentwork of Kong is often a biasvariance tradeo Often as one mo di Dietterich and Breiman es some asp ect of the learning algorithm it will have Their decomp osition suers from some ma opp osite eects on the bias and the variance For ex jor shortcomings though egpotentially ample usually as one increases the numb er of degrees negativevariance which our decomp osition of freedom in the algorithm the bias shrinks but the avoids We show that in practice the naive variance increases The optimal numb er of degrees frequencybased estimation of the decomp o of freedom as far as exp ected loss is concerned is sition terms is by itself biased and show the numb er of degrees of freedom that optimizes this how to correct for this bias We illustrate tradeo b etween bias and variance the decomp osition on various algorithms and For classication the quadratic loss function is often datasets from the UCI rep ository inappropriate b ecause the class lab els are not numeric In practice an overwhelming ma jority of researchers in the Machine Learning community instead use exp ected Intro duction misclassication rate which is equivalent to the zero one loss Kong Dietterich and Dietterich The bias plus variance decomp osition Geman Bi Kong recently prop osed a biasvariance decom enensto ck Doursat is a p owerful to ol from p osition for zeroone loss functions but their prop osal sampling theory statistics for analyzing sup ervised suers from some ma jor problems such as the p ossibil learning scenarios that have quadratic loss functions ity of negativevariance and only allowing the values As conventionally formulated it breaks the exp ected t zero or one as the bias for a given test p oin cost given a xed target and training set size into the sum of three nonnegative quantities In this pap er weprovide an alternative zeroone loss decomp osition that do es not suer from these prob Intrinsic target noise This quantityisalow er lems and that ob eys the desiderata that bias and vari b ound on the exp ected cost of any learning algo ance should ob ey as discussed in Wolp ert submit rithm It is the exp ected cost of the Bayesoptimal ted This pap er is exp ository b eing primarily con classier cerned with bringing the zeroone loss biasvariance decomp osition to the attention of the Machine Learn The training set d is a set of m pairs of xy values ing community Wedonotmakeany assumptions ab out the distribu tion of pairs In particular our mathematical results After presenting our decomp osition wedescribeaset do not require them to b e generated in an iid in of exp eriments that illustrate the eects of bias and dep endently and identically distributed manner as variance for some common induction algorithms We commonly assumed also discuss a practical problem with estimating the quantities in the decomp osition using the naiveap To assign a p enalty to a pair of values y and y we F H proach of frequency counts the frequencycount esti use the loss function Y Y R In this pap er we mators are biased in a way that dep ends on the train consider the zeroone loss function dened as ing set size We showhow to correct the estimators so that they are unbiased y y y y F H F H where y y if y y and otherwise Denitions F H F H The cost C is a realvalued random variable dened We use the following notation as the loss over the random variables Y and Y So F H the exp ected cost is The underlying spaces X E C y y P y y H F H F Let X and Y b e the input and output spaces resp ec y y H F tively with cardinalities jX j and jY j and elements x and y resp ectively To maintain consistency with For zeroone loss the cost is usually referred to as planned extensions of this pap er we assume that b oth misclassication rate and is derived as follows X and Y are countable However this assumption is not needed for this pap er provided all sums are re X y y P y y E C H F H F placed byintegrals In classication problems Y is y y H F usually a small nite set X P Y Y y H F The target f is a conditional probability distribution y Y P Y y j x where Y is a Y valued random F F F variable Unless explicitly stated otherwise we assume that the target is xed As an example if the target The notation used here is a simplied version of is a noisefree function from X to Y forany xed x the extended Bayesian formalism EBF describ ed in wehave P Y y j xforonevalue of y and F F F Wolp ert In particular the results of this pa for all others p er do not dep end on howthe X values in the test set are determined so there is no need to dene a random The hyp othesis h generated by a learning algorithm variable for those X values as is done in the full EBF is a similar distribution P Y y j x where Y H H H is a Y valued random variable As an example if the hyp othesis is a singlevalued function from X to Y as Bias Plus Variance for ZeroOne it is for many classiers eg decision trees nearest Loss neighb ors then P Y y j x for one value of H H y and for all others H Wenow showhow to decomp ose the exp ected cost We will drop the explicitly delineated random variables into its comp onents and then provide geometric views from the probabilities when the context is clear For of this decomp osition in particular relating it to example P y will b e used instead of P Y y H H H quadratic loss in Euclidean spaces Prop osition Y and Y areconditional ly indepen F H The Decomp osition dent given f and a test point x We present the general result involving the exp ected Proof P y y j f x P y j y fxP y j f x F H F H H zeroone loss E C where the implicit conditioning P y j f xP y j f x F H event is arbitrary Then we sp ecialize to the stan The last equality is true b ecause by denition y F dard conditioning used in conventional sampling the dep ends only on the target f and the test p oint x ory statistics a single test p oint target and training set size to b e explicit To b etter understand these formulas note that P Y y j x is the probability after any X F E C P Y Y y From equation noise is taken into account that the xed target takes H F on the value y at p oint xTo understand the quantity y Y P Y y j x one must write it in full as X X H P Y Y y P Y y P Y y H F H F P Y y j f m x H y Y y Y X h X P d j f m xP Y y j d f m x H P Y y P Y y P Y y F H F d y Y X i P d j f mP Y y j d x H P Y y H d In this expression P d j f m is the probability X X P Y y P Y y H F of generating training set d from the target f and y Y y Y P Y y j d x is the probability that the learning H algorithm makes guess y for p oint x in resp onse to the training set dSoP Y y j xistheaverage over H Rearranging the terms wehave E C training sets generated from f Y value guessed by the X P Y y P Y y H F learning algorithm for p oint x y Y Note that while the quadratic loss decomp osition in P Y Y y covariance F H volves quadratic terms and the log loss decomp o X sition involves logarithmic terms Wolp ert submit P Y y P Y y bias F H ted our zeroone loss decomp osition do es not involve y Y Kronecker delta terms but rather involves quadratic X terms P Y y variance H y Y Our denitions of bias variance and noise ob ey some appropriate desiderata including X P Y y F y Y The bias term measures the squared dierence between the targets average output and the al gorithms average output It is a realvalued non In this pap er we are interested in E C j f m the negative quantity and equals zero only if P Y exp ected cost where the target is xed and one aver F ages over training sets of size m One waytoevaluate y j x P Y y j xforallx and y These P H this quantity is to write it as P xE C j f m x x prop erties are shared bybias for quadratic loss and then use Equation to get E C j f m x By The variance term measures the variability Prop osition y and y are indep endent when one H F over Y ofP Y j x It is a realvalued non conditions on f and x hence the covariance term H H vanishes So negative quantity and equals zero for an algorithm X that always makes the same guess regardless of E C P x bias variance x x x the training set eg the Bayes optimal classi x er As the algorithm b ecomes more sensitiveto where changes in the training set the variance increases X Moreover given a distribution

Load more