On the Bo osting AbilityofTop-Down



Decision Learning Algorithms

Michael Kearns Yishay Mansour

AT&T Research Tel-Aviv University

May 1996

Abstract

We analyze the p erformance of top-down algorithms for learning, such as those employed

by the widely used C4.5 and CART software packages. Our main result is a pro of that such algorithms

are boosting algorithms. By this we mean that if the functions that lab el the internal no des of the

decision tree can weakly approximate the unknown target function, then the top-down algorithms we

study will amplify this weak advantage to build a tree achieving any desired level of accuracy. The b ounds

we obtain for this ampli catio n showaninteresting dep endence on the splitting criterion used by the

top-down algorithm. More precisely, if the functions used to lab el the internal no des have error 1=2

as approximation s to the target function, then for the splitting criteria used by CART and C4.5, trees

2 2 2

O 1=   O log 1== 

of size 1= and 1= resp ectively suce to drive the error b elow .Thus for

example, a small constant advantage over random guessing is ampli ed to any larger constant advantage

with trees of constant size. For a new splitting criterion suggested by our analysis, the much stronger

2

O 1= 

b ound of 1= which is p olynomial in 1= is obtained, whichisprovably optimal for decision

tree algorithms. The di ering b ounds have a natural explanation in terms of concavity prop erties of the

splitting criterion.

The primary contribution of this work is in proving that some p opular and empiricall y successful

heuristics that are based on rst principles meet the criteria of an indep endently motivated theoretical

mo del.



A preliminary version of this pap er app ears in Proceedings of the Twenty-Eighth Annual ACM Symposium on the Theory of

Computing, pages 459{468, ACM Press, 1996. Authors' addresses: M. Kearns, AT&T Research, 600 Mountain Avenue, Ro om

2A-423, Murray Hill, New Jersey 07974; electronic mail [email protected]. Y. Mansour, Department of Computer

Science, Tel Aviv University,Tel Aviv, Israel; electronic mail [email protected]. Y. Mansour was supp orted in part by

the Israel Science Foundation, administered by the Israel Academy of Science and Humanities, and by a grant of the Israeli

Ministry of Science and Technology.

1 Intro duction

In exp erimental and applied work, it is hard to exaggerate the in uence of top-down

heuristics for building a decision tree from lab eled sample data. These algorithms grow a tree from the

ro ot to the leaves by rep eatedly replacing an existing leaf with an internal no de, thus \splitting" the data

reaching this leaf into two new leaves, and reducing the empirical error on the given sample. The tremendous

p opularityofsuch programs which include the C4.5 and CART software packages [16, 3] is due to their

eciency and simplicity, the advantages of using decision trees such as p otential interpretabilitytohumans,

and of course, to their success in generating trees with go o d generalization ability that is, p erformance on

new data. Dozens of pap ers describing exp eriments and applications involving top-down decision tree

learning algorithms app ear in the machine learning literature eachyear [1].

There has b een diculty in nding natural theoretical mo dels that provide a precise yet useful language

in which to discuss the p erformance of top-down decision tree learning heuristics, to compare variants of

these heuristics, and to compare them to learning algorithms that use entirely di erent representations and

approaches. For example, even if we make the rather favorable assumption that there is a small decision tree

lab eling the data, and that the inputs are distributed uniformly thus, we are in the well-known Probably

Approximately Correct or PAC mo del [18], with the additional restriction to the uniform distribution,

the problem of nding any ecient algorithm with provably nontrivial p erformance remains an elusiveopen

problem in computational learning theory.Furthermore, sup erp olynomial lower b ounds for this same problem

have b een proven for a wide class of algorithms that includes the top-down decision tree approach and also

all variants of this approach that have b een prop osed to date [2]. The p ositive results for ecient decision

tree learning in computational learning theory all make extensive use of membership queries [14,5,4, 11],

which provide the learning algorithm with black-b ox access to the target function exp erimentation, rather

than only an oracle for random examples. Clearly, the need for memb ership queries severely limits the

p otential application of such algorithms, and they seem unlikely to encroach on the p opularity of the top-

down decision tree algorithms without signi cant new ideas. In summary, it seem fair to say that despite

their other successes , the mo dels of computational learning theory have not yet provided signi cant insight

into the apparent empirical success of programs like C4.5 and CART.

In this pap er, we attempt to remedy this state of a airs by analyzing top-down decision tree learning

algorithms in the mo del of weak learning [17, 10, 9]. In the language of weak learning, we prove here that the

standard top-down decision tree algorithms are in fact boosting algorithms. By this we mean that if we make

afavorable and apparently necessary assumption ab out the relationship b etween the class of functions

lab eling the decision tree no des and the unknown target function | namely, that these no de functions

p erform slightly b etter than random guessing as approximations to the target function | then the top-down

algorithms will eventually drive the error b elowany desired value. In other words, the top-down algorithms

will amplify slight advantages over random guessing into arbitrarily accurate hyp otheses. Our results imply,

for instance, that if there is always a no de function whose error with resp ect to the target function on an

appropriate distribution is a non-zero constant less than 1=2 constant say, 49 error, then the top-down

algorithms will drive the error b elowany desired constant for instance, 1 error using only constant tree

size. For certain top-down algorithms, the error can b e driven b elow  using a tree whose size is p olynomial

in 1=. See Theorem 1 for a precise statement. Thus, even though the top-down heuristics and the notion

of weak learning were develop ed indep endently,we prove that these heuristics do in fact achieve nontrivial

p erformance in the weak learning mo del. This is p erhaps the most signi cant asp ect of our results: the pro of

that some p opular and empirically successful algorithms that are based on rst principles meet the criteria

of an indep endently motivated theoretical mo del.

An added b ene t of our ability to analyze the top-down algorithms in a standard theoretical mo del is that

it allows comparisons b etween variants. In particular, the choice of the splitting criterion function G used by

the top-down algorithm has a profound e ect on the resulting b ound. We will de ne the splitting criterion

shortly, but intuitively its role is to assign values to p otential splits, and thus it implicitly determines which

leaf will b e split, and which function will b e used to lab el the split. Our analysis yields radically di erent

b ounds for the choices made for G by C4.5 and CART, and indicates that b oth may b e inferior to a new

choice for G suggested by our analysis. Preliminary exp eriments supp orting this view are rep orted in our

recent follow-up pap er [7]. In addition to providing a nontrivial analysis of the p erformance of top-down 1

decision tree learning, the pro of of our results givesanumb er of sp eci c technical insights into the success and

limitations of these heuristics. The weak learning framework also allows comparison b etween the top-down

heuristics and the previous b o osting algorithms designed esp ecially for the weak learning mo del. Perhaps

not surprisingly, the b ounds we obtain for the top-down algorithms are in some sense signi cantly inferior

1

to those obtained for other b o osting algorithms However, as we discuss in Section 3, this is unavoidable

consequence of the fact that we are using decision trees, and is not due to any algorithmic prop erties of the

top-down approach.

2 Top-Down Decision Tree Learning Algorithms

We study a class of learning algorithms that is parameterized bytwo quantities: the node function class F and

the splitting criterion G. Here F is a class of b o olean functions with input domain X , and G :[0; 1] ! [0; 1]

is a function having three prop erties:

 G is symmetric ab out 1=2. Thus, Gx= G1 x for any x 2 [0; 1].

 G1=2 = 1 and G0 = G1 = 0.

 G is concave.

We will call such a function G a permissible splitting criterion. The binary entropy

Gq =H q =q logq  1 q  log1 q  1

isatypical example of a p ermissible splitting criterion. We will so on describ e the roles of F and G precisely,

but intuitively the learning algorithms will build decision trees in which the internal no des are lab eled by

functions in F , and the splitting criterion G will b e used by the learning algorithm to determine which leaf

should b e split next, and which function h 2 F to use for the split.

Throughout the pap er, we assume a xed target function f with domain X , and a xed target distribution

P over X .Together f and P induce a distribution on labeled examples hx; f xi.We will assume that the

algorithms we study can exactly compute some rather simple probabilities with resp ect to this distribution

on lab eled examples. If this distribution can instead only b e sampled, then our algorithms will approximate

the probabilities accurately using only a p olynomial amount of sampling, and via standard arguments the

analysis will still hold.

Let T be any decision tree whose internal no des are lab eled by functions in F , and whose leaves are

lab eled byvalues in f0; 1g.We will use leaves T  to denote the leaves of T .For each ` 2 leaves T , we

de ne w ` the weight of ` to b e the probability that a random x drawn according to P  reaches leaf ` in

T , and we de ne q ` to b e the probability that f x=1 given that x reaches `. Let us assume that each

leaf ` in T is lab eled 0 if q `  1=2, and is lab eled 1 otherwise we refer to this lab eling as the majority

lab eling. We de ne

X

T = w ` minq `; 1 q `: 2

`2leaves T 

Note that T =Pr [T x 6= f x], so T  is just the usual measure of the error of T with resp ect to f

P

and P .NowifG is the splitting criterion, we de ne

X

GT = w `Gq `: 3

`2leaves T 

It will b e helpful to think of Gq ` as an approximation to minq `; 1 q `, and thus of GT as

an approximation to T . In particular, if G is a p ermissible splitting criterion, then wehave Gq  

minq; 1 q  for all q 2 [0; 1], and thus GT   T  for all T . The top-down algorithms we examine make

lo cal mo di cations to the current tree T in an e ort to reduce GT , and therefore hop efully reduce T as

1

The recent pap er of Dietterich et al. [7] gives detailed exp erimental comparisons of top-down decision tree algorithms and

the b est of the standard b o osting metho ds; see also the recent pap er of Freund and Schapire [8]. 2

TopDown t:

F;G

1 Initialize T to b e the single-leaf tree, with binary lab el equal to the ma jority lab el of the sample.

2 while T has fewer than t internal no des:

3   0.

best

4 for each pair `; h 2 leaves T   F :

5   GT  GT `; h.

6 if    then :

best

7   ; `  `; h  h.

best best best

8 T  T ` ;h .

best best

9 Output T .

Figure 1: Algorithm TopDown t.

F;G

well. We will need notation to describ e these lo cal changes. Thus, if ` 2 leaves T  and h 2 F ,we use T `; h

to denote the tree that is the same as T , except that wenow make a new internal no de at ` and lab el this

no de by the function h. The newly created child leaves ` and ` corresp onding to the outcomes hx=0

0 1

and hx = 1 at the new internal no de are lab eled by their ma jority lab els with resp ect to f and P .

For no de function class F and splitting criterion G,we give pseudo-co de for the algorithm TopDown t

F;G

in Figure 1. This algorithm takes as input an integer t  0, and outputs a decision tree T with t internal

no des. Note that line 5 is the only place in the co de where we make use of the assumption that we can

compute probabilities with resp ect to P exactly, and as mentioned ab ove, these probabilities are suciently

simple that they can b e replaced by a p olynomial amount of sampling from an oracle for lab eled examples

via standard arguments see Theorem 1. Two other asp ects of the algorithm deserve mention here. First,

although wehave written the search for ` ;h  explicitly as an exhaustive search in the for lo op of

best best

line 4, clearly for certain F a more direct minimization might b e p ossible, and in fact necessary in the case

that F is uncountably in nite for example, if F were the class of all thresholded linear combinations of two

input variables. Second, all of our results still hold with appropriate quanti cation if we can only nd a

pair `; h that approximates  .For example, if wehave a heuristic metho d for always nding `; h such

best

that  = GT  GT `; h   =2, then our main results hold without mo di cation.

best

Our primary interest will b e in the error of the tree T output by TopDown t as a function of t.To

F;G

emphasize the dep endence on t, for xed F and G we use  to denote T  and G to denote GT . We

t t

think of G as a \p otential function" that hop efully decreases with each new split.

t

Algorithms similar to TopDown t are in widespread use in b oth applications and the exp erimental

F;G

machine learning community[1,16,3]. There are of course many imp ortant issues of implementation that we

have omitted that must b e addressed in practice. Foremost among these is the fact that in applications, one

do es not usually have an oracle for the exact probabilities, or even an oracle for random examples, but only

a random sample S of xed often small size. To address this, it is common practice to use the empirical

distribution on S to rst grow the tree in the top-down fashion of Figure 1 until  is actually zero which

t

may require t to b e on the order of jS j, and then use the splitting criterion G again to prune the tree. This

is done in order to avoid the phenomenon known as over tting , in which the error on the sample and the

error on the distribution diverge [16, 3]. Despite such issues, our idealization TopDown t captures the

F;G

basic algorithmic ideas b ehind many widely used decision tree algorithms. For example, it is fair to think

of the p opular C4.5 software package as a variantofTopDown t in which Gq  is the binary entropy

F;G

function, and F is the class of single variables pro jection functions [16]. Similarly, the CART program uses

the splitting criterion Gq =4q 1 q , known as the Gini criterion [3]. We feel that the analysis given here

provides some insightinto why such top-down decision tree algorithms succeed, and what their limitations

are.

Let us now discuss the choice of the no de function class F in more detail. Intuitively, the more p owerful

the class of no de functions, the more rapidly we might exp ect G and  to decay with t. This is simply b ecause

t t

more p owerful no de functions may allow us to represent the same function more succinctly as a decision

tree. Of course, the more p owerful F is, the more computationally exp ensive TopDown t b ecomes:

F;G 3

even if the choice of F is such that we can replace the naive for lo op of line 4 by a direct computation of

` ;h , we still exp ect this minimization to require more time as F b ecomes more p owerful. Thus there

best best

is a trade-o b etween the expressivepower of F , and the running time of TopDown t.

F;G

In the software packages C4.5 and CART, the default is to err on the side of simplicity: typically just the

pro jections of the input variables and their negations or slightly more p owerful classes are used as the no de

functions. The implicit attitude taken is that in practical applications, exp ensive searches over p owerful

classes F are simply not feasible, so we ensure the computational eciency of computing ` ;h  or

best best

at least a go o d approximation, and hop e that a simple F is suciently p owerful to signi cantly reduce

G with each new internal no de added. In our analysis, the class F is a parameter. Our approachmay

t

b e paraphrased as follows: assuming that wehave made a \favorable" choice of F , what can wesay ab out

the rate of decayofG and  as a function of t? Of course, for this approachtobeinteresting, we need

t t

a reasonable de nition of what it means to have made a \favorable" choice of F , and it is clear that this

de nition must say something ab out the relationship b etween F and the target function f . In the next

section, we adopt the Weak Hypothesis Assumption motivated by and closely related to the mo del of Weak

Learning [13, 17, 10] to quantify this relationship.

We defer detailed discussion of the choice of the p ermissible splitting criterion G, since one of the main

results of our analysis is a rather precise reason why some choices maybevastly preferable to others. Here

it suces to simply emphasize that di erentchoices of G can in fact result in di erent trees b eing built:

thus, b oth ` and h may dep end strongly on G. The b est choice of G in practice is far b eing from

best best

a settled issue, as evidenced by the fact that the two most p opular decision tree learning packages C4.5

and CART use di erentchoices for G. There have also b een a numb er of exp erimental pap ers examining

various choices [15, 6]. Perhaps the insights in this pap er most relevant to the practice of machine learning

are those regarding the b ehavior of TopDown as a function of G [7].

F;G

3 The Weak Hyp othesis Assumption

Wenow quantify what we mean by a \favorable" choice of the no de splitting class F . The de nition we

adopt is essentially the one used bya numb er of previous pap ers on the topic of weak learning [17, 9, 10].

De nition 1 Let f be any over an input space X . Let F be any class of boolean

functions over X .Let 2 0; 1=2]. We say that f -satis es the Weak Hyp othesis Assumption with resp ect

to F if for any distribution P over X , there exists a function h 2 F such that Pr [hx 6= f x]  1=2 .

P

If P is a class of distributions, we say that f -satis es the Weak Hypothesis Assumption with respect to F

over P if the statement holds for any P 2P.Wecal l the parameter the advantage.

It is worth mentioning that this de nition can b e extended to the case where F is a class of probabilistic

b o olean functions [12]. All of our results hold for this more general setting.

Note that if F actually contains the function f , then f trivially 1=2-satis es the Weak Hyp othesis

Assumption with resp ect to F .IfF do es not contain f , then the Weak Hyp othesis Assumption amounts

to an assumption on f . More precisely, itisknown that Weak Hyp othesis Assumption is equivalenttothe

assumption that on any distribution, f can b e approximated by thresholded linear combinations of functions

in F [9].

Compared to the PAC mo del and its variants, in the Weak Hyp othesis Assumption we take a more

incremental view of learning: rather than assuming that the target function lies in some large, xed class,

and trying to design an ecient algorithm for searching this class, we instead hop e that \simple" functions

in our case, the decision tree no de functions are already slightly correlated with the target function, and

analyze an algorithm's ability to amplify these slight correlations to achieve arbitrary accuracy. Based on

our results here and our remarks in the intro duction regarding the known results for decision tree learning

in the PAC mo del, it seems that the weak learning approach is p erhaps more enlightening for the decision

tree heuristics we analyze.

Under the Weak Hyp othesis Assumption, several previous pap ers have prop osed boosting algorithms that

combine many di erent functions from F , each with a small predictive advantage over random guessing on

a di erent ltered distribution , to obtain a single hybrid function whose generalization error on the target 4

distribution P is less than any desired value  [17,9,10]. The central question examined for such algorithms

is: As a function of the advantage and the desired error ,how many functions must b e combined, and

how many random examples drawn? Several b o osting algorithms enjoyvery strong upp er b ounds on these

quantities [17,9,10]: the numb er of functions that must b e combined is p olynomial in 1= and log1=,

and the numb er of examples required is p olynomial in 1= and 1=. These b o osting metho ds represent their

hyp othesis as a thresholded linear combination of functions from F ,a choice that is obviously well-suited to

the Weak Hyp othesis Assumption in light of the remarks ab ove.

The goal of this pap er is to give a similar analysis for top-down decision tree algorithms: as a function of

and ,how many no des must the constructed tree have that is, how large must t b e b efore the error  is

t

driven b elow ?However, we can immediately dismiss the idea that any decision tree learning algorithm top-

down or otherwise can achieve p erformance comparable to that achieved by algorithms sp eci cally designed

n

for the weak learning mo del. To see this, supp ose the input space X is f0; 1g , and that f is the ma jority

function. Then if F is the class of single-variable pro jection functions, it is known that f 1=n-satis es the

Weak Hyp othesis Assumption with resp ect to F .However, the smallest decision tree with generalization

error less than 1=4 with resp ect to f on the uniform input distribution has size exp onential in n.>From this

example we can immediately infer that t must b e exp onential in 1= in order for TopDown ttoachieve

F;G

any nontrivial p erformance. Note, however, that this limitation is entirely representational | it arises solely

from the fact that wehavechosen to learn using a decision tree over F , not from any prop erties of the

algorithm we use to construct the tree. A more sophisticated construction due to Freund [9] and discussed

2

c=

following Theorem 1 implies that this representational lower b ound can b e improved to 1=  for

some constant c>0.

We will show that for appropriately chosen G, algorithm TopDown t in fact achieves this optimal

F;G

2

c=

b ound of O 1= .

It is reasonable to ask whether some assumption other than the Weak Hyp othesis Assumption might

p ermit even more favorable analyses of top-down decision tree algorithms. In this regard, we brie y note

that, like other b o osting analyses, for our analysis a signi cantweakening of the Weak Hyp othesis Assumption

in fact suces. Namely, our results hold if f satis es the Weak Hyp othesis Assumption over P , where P

contains the target distribution P and all those lterings of P that are constructed by the algorithm this

notion will b e made more precise in the analysis. Furthermore, it can b e shown that this weaker assumption

is in fact implied by the rapid decayof . In other words, this weakened assumption holds if and only if

t

TopDown t p erforms well. Thus the Weak Hyp othesis Assumption seems to b e the correct framework

F;G

in which to analyze algorithm TopDown t.

F;G

4 Statement of the Main Result

Our main result gives upp er b ounds on the error  of TopDown t for three di erentchoices of the split-

t

F;G

ting criterion G, under the assumption that the target function -satis es the Weak Hyp othesis Assumption

for F . Equivalently,we give upp er b ounds on the value of t required to satisfy    for any given .

t

Two of our choices for G are motivated by the p opular implementations of TopDown t previously

F;G

discussed. The Gini criterion used by the CART program is Gq =4q 1 q , and the entropy criterion used

by the C4.5 software package is Gq =H q =q logq  1 q  log1 q . It is easily veri ed that b oth

p

are p ermissible. The third criterion we examine is Gq = 2 q 1 q . This new choice is motivated by our

analysis: it is the splitting criterion for whichwe obtain the b est b ounds. A plot of these three functions is

given in Figure 2.

Theorem 1 Let F be any class of boolean functions, let 2 0; 1=2], and let f be any target boolean

function that -satis es the Weak Hypothesis Assumption with respect to F .Let P be any target distribution,

and let T be the tree output by TopDown t. Then for any , the error T  is less than  provided that

F;G

2 2

c=  log1=

1

t  if Gq = 4q 1 q ; 4

 5

for some constant c> 0;orprovided that

2

c log1==

1

t  if Gq = H q ; 5



for some constant c> 0;orprovided that

2

c=

p

1

q 1 q ; 6 if Gq = 2 t 



for some constant c>0.Furthermore, suppose that instead of the ability to compute probabilities exactly

with respect to P ,weare only given an oracle for random examples hx; f xi. Consider taking a random

sample S of m examples, and running algorithm TopDown t using the empirical distribution on S .

F;G

There exists a polynomial plog1= ; 1=; t, such that if m is larger than plog1= ; 1=; t and t satis es

the above conditions, then with probability greater than 1  , the error T  of T with respect to f and P

wil l be smal ler than .

In terms of the dep endence on the advantage parameter , all three b ounds are exp onential in 1= .

As wehave discussed, the ma jority example shows that this dep endence is an unavoidable consequence of

using decision trees. In terms of dep endence on ,however, there are vast di erences b etween the b ounds for

di erentchoices of G.For the Gini criterion, the b ound is exp onential in 1=, and for the entropy criterion of

p

C4.5 it is sup erp olynomial but sub exp onential. Only for the new criterion Gq = 2 q 1 q dowe obtain

a b ound p olynomial in 1=. The following argument, based on ideas of Freund [9], demonstrates that this

b ound is in fact optimal for decision tree algorithms top-down or otherwise. Let f b e the target function,

and supp ose that the class F of splitting functions contains just a single probabilistic function h such that for

any input x, Pr[hx 6= f x]=1=2 . Here the probability is taken only over the internal randomization

of h.Intuitively, to approximate f , one would take many copies of h each with indep endent random bits

and take the ma jorityvote. For any xed tree size t, the optimal decision tree over F will simply b e the

complete binary tree of depth logt in which eachinternal no de is lab eled by h that is, by a copyofh with

its own indep endent random bits. However, it is not dicult to prove that in order for the tree to have

2

error less than , the depth of the tree must b e 1=  log1=, resulting in a lower b ound on the size t

p

2

c=

of 1=  for some constant c>0. Thus, the b ound given for the choice Gq =2 q 1 q  given in

Theorem 1 is optimal for decision tree learning algorithms. We do not have matching lower b ounds for the

upp er b ounds given in Theorem 1 for the Gini and entropy splitting criteria. The pro of of Theorem 1 will

show that there are go o d technical reasons for b elieving that the claimed di erences b etween the criteria are

qualitatively real, and this seems to b e b orne out by preliminary exp erimental results [7].

The remainder of the pap er is devoted to the pro of of Theorem 1. We emphasize and discuss the parts

of the analysis that shed light on imp ortant issues such as the reason the Weak Hyp othesis Assumption is

helpful for TopDown t, and the di erences b etween the various choices for G.

F;G

5 Pro of of the Main Result

Our approach is to b ound G as a function of t and , and therefore b ound  as well. The main idea is to

t t

show that G decreases by at least a certain amount with each new split.

t

Let us x a leaf ` of T , and use the shorthand w = w ` and q = q ` recall that w ` is the probability

of reaching `, and q ` is the probability that the target function is 1 given that wehave reached `. Now

supp ose we split the leaf ` by replacing it by a new internal no de lab eled by the test function h 2 F that

is, T b ecomes T `; h. Then wehavetwo new child leaves of the now-internal no de `, one corresp onding

to the test outcome hx = 0 and the other corresp onding to the test outcome hx = 1. Let us refer to

these leaves as ` and ` resp ectively, and let  denote the probability that x reaches ` given that x reaches

0 1 1

the internal no de ` thus, w ` =1  w and w ` =w. Let p = q `  and = q ` ; then wehave

0 1 0 1

q =1  p + r. This simply says that the total probability that f x=1must b e preserved after splitting

at `. See Figure 4 for a diagram summarizing the split parameters. 6

We are now in p osition to make some simple but central insights regarding how the split at ` reduces

the quantity G . Note that since q =1  p + r and  2 [0; 1], one of p and r is less than or equal to

t

q and the other is greater than or equal to q . Without loss of generality, let p  q  r .Now b efore the

split, the contribution of the leaf ` to G was w  Gq . After the split, the contribution of the now-internal

t

no de ` to G is w  1  Gp+ Gr . Thus the decrease to G is caused by the split is exactly

t+1 t

w  Gq  1  Gp Gr . It will b e convenient to analyze the local decrease to G ,thus assuming

t

w = 1, and reintro duce the w factor later. Hence, we are interested in lower bounding the quantity

Gq  1  Gp Gr : 7

This quantity is simply the di erence b etween the value of the concave function G at p oint q , whichisa

linear combination of p and r , and the value of the same linear combination of Gp and Gr ; see Figure 5.

De ne  = r p. It is easy to see that if  is small, or if  is near 0 or 1, then Gq  1  Gp Gr 

will b e small. Thus to lower b ound the decrease to G ,we will rst need to argue that under the Weak

t

Hyp othesis Assumption these conditions cannot o ccur. Towards this goal, let P denote the distribution of

`

0 0

inputs reaching `, and let the \balanced" distribution P b e de ned by P x=P x=2q iff x=1and

`

` `

0 0

P x=P x=21 q  if f x = 0. Thus P is simply P mo di ed to give equal weight to the p ositive

` `

` `

and negative examples of f see Figure 6. Lemma 2 b elow shows that if the function h used to split at `

witnesses the Weak Hyp othesis Assumption for the balanced distribution at `, then  cannot b e to o small,

and  cannot b e to o near 0 or 1.

5.1 Constraints on  and  from the Weak Hyp othesis Assumption

Lemma 2 Let q ,  and  be as de nedabove for the split at leaf `.Let P be the distribution induced

`

0

be the balanced distribution at `. If the function h placedat by P on those inputs reaching `, and let P

`

0

0

, then ` obeys Pr [hx 6= f x]  1=2 thus, h witnesses the Weak Hypothesis Assumption for P

P

`

`

 1    q1 q .

Pro of:Recall that  =Pr [hx = 1] and r =Pr [f x=1jhx = 1]. It follows that Pr [f x=

P P P

` ` `

1;hx = 1] = r and therefore Pr [f x=0;hx = 1] =  r =  1 r . In going from P to

P `

`

0

the balanced distribution P , inputs x such that f x = 0 are scaled by the factor 1=21 q  and thus

`

0

Pr [f x=0;hx=1]=1=21 q  1 r . Similarly, since q =Pr [f x = 1], wehavePr [f x=

P P

P

` `

`

0

1;hx=0]= q r, and this is scaled by the factor 1=2q to obtain Pr [f x=1;hx = 0] = 1=2q q r .

P

`

0

See Figure 6. Thus, wemay write an exact expression for Pr [hx 6= f x]:

P

`

1 1 1  1 r r

Pr[hx 6= f x] =  1 r + q r= + : 8

0

P 21 q  2q 2 2 1 q q

`

0

By the assumption on h,Pr [hx 6= f x]  1=2 .Thus,

P

`

r 1 r 

 : 9

2 q 1 q

Substituting r = q +1   yields

 1   1 1

+  : 10

2 q 1 q

1 2

Since 1=q +1=1 q =  , the lemma follows. 2Lemma 2

q 1q  q 1q 

It is imp ortant to note that Lemma 2 do es not hold in general if we let the function h placed at ` b e the

Weak Hyp othesis Assumption witness for the unbalanced distribution P . The reason is that if q and 1 q

`

are unbalanced say, 0.7 and 0.3 then the Weak Hyp othesis Assumption witness for P may b e the constant

`

function h 1, in which case we will have  = 1, and there is no decrease to G . Lemma 2 is also the only

t

place in the pro of in which the Weak Hyp othesis Assumption will b e used. Thus, the \weaker" assumption

mentioned ab ove that suces for our main result to hold is simply that the Weak Hyp othesis Assumption 7

0

holds over all balanced distributions P , where ` is anynodeinany decision tree over F . Note that this class

`

of distributions is always a sub class of those distributions that can b e obtained by taking a subset of the

supp ort of P , setting the probability of this subset to zero, and renormalizing. Thus, the resulting class of

ltered distributions is in some sense simpler than those generated by other b o osting algorithms [17,9, 10].

See the recent pap er of Dietterich et al. [7] for further discussion of this issue.

Our goal now is to obtain for each G alower b ound on the lo cal drop Gq  1  Gp Gr to G

t

under the condition  1    q1 q  given by Lemma 2. We emphasize at the outset that it is of the

utmost imp ortance to obtain the b est largest lower b ounds p ossible, since eventually these lower b ounds

directly in uence the exponent of the resulting b ounds on t. Obtaining these lower b ounds will b e the most

involved step of our analysis. Before taking it, however, let us lo ok ahead slightly and observe that we exp ect

the lower b ound we obtain to dep end critically on prop erties of the splitting criterion Gq . In particular,

it seems that a go o d choice of Gq  should have large curvature for all values of q , for such a function will

maximize the di erence b etween Gq  and the linear combination 1  Gp+Gr  see Figure 5. In this

light, we see that the choice Gq =2minq; 1 q may b e an esp ecially poor choice, b ecause even under

the condition  1    q1 q  or other similar conditions, Gq =1  Gp+ Gr may hold.

In fact, the only situation in which the decrease to G will b e non-zero for this choice of Gq  will b e when

t

p<1=2 and r>1=2, a situation which is certainly not implied by the Weak Hyp othesis Assumption. The

reason that 2 minq; 1 q isabadchoice is that it exhibits concavity only around the p oint q =1=2, rather

than for all values of q . In Figures 2 and 3, we plot the three choices for G that we will examine, so that

their relative concavity prop erties can b e compared. These concavity prop erties play a crucial role in the

nal b ounds we obtain for each of the choices for G. In this regard, the imp ortant p ointtokeep in mind

during the coming minimization is: for each G,howdoes G G | the amountby whichwe reduce our

t t+1

\p otential" with each split | compare to G itself, which is the p otential remaining?

t

5.2 A Constrained Multivariate Minimization Problem

In order to analyze the e ects of the condition  1    q1 q  on the quantity Gq  1  Gp

Gr =G G , it will b e helpful to rewrite this quantity in terms of q ,  and  .Thus we substitute

t t+1

p = q  and r = q +1   and de ne

 q;  ;  = Gq  1  Gq  Gq +1   : 11

G

To obtain the desired lower b ound, wemust solve a constrained multivariate minimization problem: namely,

wemust minimize  q;  ;   sub ject to the constraint  1    q1 q . As mentioned previously,we

G

would like to express the resulting constrained minimum of  q;  ;   as a function of and q only.Towards

G

this goal, we rst x  and  and lower b ound  q;  ;  by an algebraic expression of its parameters.

G

p

Lemma 3 Let Gq  be one of 4q 1 q ;Hq  and 2 q 1 q  . Then for any xed values for ;  2 [0; 1],

2 3

 1    1  1 2 

00 000

 q;  ;    G q  G q : 12

G

2 6

Pro of:With  and  xed, we p erform a Taylor expansion of Gq at q , and replace the o ccurrences of

Gq , Gq  and Gq +1   in q;  ;  by their Taylor expansions around q . It is easily veri ed

G

that the terms involving zero-order and rst-order derivatives of G cancel in the resulting expression for

 q;  ;  . The contribution to  q;  ;  involving second-order derivatives is

G G

00

1   G q 

00 2 00 2 2

G q  G q 1    =  1   : 13

2 2 2

The contribution involving third-order derivatives is

000

 G q  1 

000 3 000 3 3

G q  G q 1    =  1  1 2  : 14

6 6 6

The lower b ound claimed in the lemma is simply the sum of the second-order and third-order contributions.

The fact that it is actually a lower b ound on  q;  ;   follows from the fact that for the three G under

G 8

consideration, the derivatives are either all negative and thus all make p ositive contributions to the b ound,

allowing us to takeany pre x of the Taylor expansion to obtain a lower b ound, or have alternating signs

with negativeeven-order derivatives and thus even-order derivatives make a p ositive contribution to the

b ound and o dd-order derivatives make a negative contribution. In the latter case, the terms of the Taylor

expansion can b e group ed in adjacent pairs, with each pair giving p ositive contribution. Thus by summing

through an o dd-order derivativewe obtain a lower b ound. 2Lemma 3

Lemma 3 provides us with an algebraic expression lower b ounding  q;  ;  . Wenow wish to show

G

that at the constrained minimum of  q;  ;  , this expression can b e further simpli ed to obtain a lower

G

b ound involving only q and .Thus, we wish to use the constraint of Lemma 2 to eliminate the parameters

 and  .

We b egin by substituting for  under the given constraint  1    q1 q . It is clear from

Equation 11 and Figure 5 that for any xed values q;  2 [0; 1],  q;  ;   is minimized bycho osing  as

G

small as p ossible. Thus let us set  = q1 q = 1   and de ne

q1 q  q1 q  q1 q 

 q;  =  = Gq  1  G q G q + : 15 q;  ;

G G

 1   1  

Our next lemma shows that for the three choices of Gq  under consideration, for all values of q the minimizing

value of  can b e b ounded away from 0 and 1. This will allow us to replace o ccurrences of  with constant

values.

p

Lemma 4 Let Gq  be one of 4q 1 q ;Hq  and 2 q 1 q .Let 2 [0; 1] be any xed value, and let

 q;   be as de ned in Equation 15. Then for any xed q 2 [0; 1],  q;   is minimized by a value of 

G G

fal ling in the interval [0:4; 0:6].

Pro of:We di erentiate  q;   with resp ect to  :

G

q1 q  q1 q  q1 q  @  q;  

G

0

= G q 1  G q 

2

@ 1  1  1  

q1 q  q1 q  q1 q 

0

16 G q + G q + 

2

  

q1 q  q1 q  q1 q 

0

= G q +  G q

1  1  1 

 

q1 q  q1 q  q1 q 

0

 G q + 17 G q +

  

0 0

= Gq +  G q  [Gq +1    1    G q +1   ] 18

where we recall  = q1 q = 1  . This last expression for @  q;  =@  has a natural and helpful

G

geometric interpretation: it is the di erence b etween two tangent lines to the function G, b oth evaluated at

0

the p oint q . See Figure 7. The term Gq +  G q  computes the line tangenttoG at the p oint

0

q , and evaluates this line at the p oint q , while the term Gq +1    1    G q +1    computes

the line tangenttoG at the p oint q +1   , and again evaluates this line at q .Thus, for the G that we are

0

considering, for any xed values of q and the minimizing values of  satisfy Gq +  G q =

0

Gq +1    1    G q +1    and therefore @  q;  =@  = 0. Wewould liketoshow

G

that for some particular choices for G,any minimizing value of  is always for all q; 2 [0; 1], 2 0; 1=2],

and  = q1 q = 1   b ounded away from 0 and 1. It can b e seen for =0:1 in Figures 8

through 13 and formal pro ofs for all are given in Lemma 12, Lemma 13 and Lemma 11 in the App endix

p

that if Gq = 4q 1 q ;Gq =H q orGq =2 q 1 q , then for all q; 2 [0; 1], if   0:4 then

0 0

Gq +   G q   Gq +1    1    G q +1    and thus @  q;  =@   0,

G

0 0

and if   0:6 then Gq +  G q   Gq +1    1    G q +1    and thus

@  q;  =@   0. This means that for these three choices for G, all the minimizing values of  lie in the

G

interval [0.4,0.6] for all values of q and , as desired. 2Lemma 4 9

5.3 E ects of the Choice of G on the Drop to G

t

Let us review where we are: in Lemma 3 wehave givenalower b ound on  q;  ;   that involves all

G

three parameters and derivatives of G, and in Lemma 4 wehave shown that under the constraint  =

q1 q = 1  , to lower b ound  q;   for the Gwe examine, wemay assume that  2 [0:4; 0:6].

G

Wenow apply these general results to obtain lower b ounds for sp eci c choices of G.

Lemma 5 Under the constraint  1    q1 q  given by Lemma 2, if Gq = 4q 1 q  then for

any q 2 [0; 1]

2 2

 q;  ;    16 q 1 q  : 19

G

00 000

Pro of:Here wehave derivatives G q =8 and G q  = 0. Application of Lemma 3 gives

2

 q;  ;    4 1   20

G

and the substitution  = q1 q = 1   yields

2 2

 q;    16 q 1 q  : 21

G

Here wehave used the fact that  1    1=4. 2Lemma 5

Lemma 6 Under the constraint  1    q1 q  given by Lemma 2, if Gq =H q  then for any

q 2 [0; 1]

2

 q;  ;    q 1 q : 22

G

00 000

Pro of:Using the fact that for Gq = H q wehave derivatives G q =1=1 q +1=q  and G q =

2 2

1=1 q  1=q , application of Lemma 3 yields

2 3

 1   1 1  1  1 2  1 1

 q;  ;    + + 23

G

2 2

2 1 q q 6 1 q  q

2 3

 1    1  1 2  1 2q 

= : 24

2

2q 1 q  6q 1 q 

Substituting  = q1 q = 1   yields

3 2

q 1 q 1 2q 1 2  q 1 q 

25  q;   

G

2

2 1   6 1  

2

q 1 q 1 2  0:4

2

 2 q 1 q  26

2 2

120:4 0:4

2

 q 1 q  27

as desired. Here to lower b ound the rst term wehave used the fact that  1    1=4 always, and for the

second term wehave used 1 2q  1,  1=2, and Lemma 4, which states that the minimizing value of 

lies in the interval [0:4; 0:6] for all q . 2Lemma 6

p

Lemma 7 Under the constraint  1    q1 q  given by Lemma 2, if Gq = 2 q 1 q  then for

any q 2 [0; 1]

2 2 1=2

1 2q  q 1 q 

2 3=2

 q;  ;    +2 q 1 q  : 28

G

2

Pro of:Here wehave derivatives

2

2 1 2q 

00

29 G q =

3=2 1=2

2q 1 q  q 1 q 

and

3

31 2q  31 2q 

000

G q = + : 30

5=2 3=2

4q 1 q  q 1 q  10

Plugging these derivatives into Lemma 3 yields

 

2 2

 1   1 2q  2

 q;  ;    +

G

3=2 1=2

2

2q 1 q  q 1 q 

 

3 3

 1  1 2  31 2q  31 2q 

: 31 +

5=2 3=2

6

4q 1 q  q 1 q 

Substituting  = q1 q = 1   yields

 

2 2

1 2q 

1=2 3=2

 q;    q 1 q  +2q 1 q 

G

2 1   2

 

3 3

31 2q  1 2 

1=2 3=2

q 1 q  + 31 2q q 1 q  32

2

6 1   4

 

2 2 1=2

1 2q  q 1 q  1 2 1 2q 

= 1

4 1   2 1  

 

2

1 2 1 2q 

3=2

+ 33 q 1 q  1

 1   2 1  

 

2 2 1=2

0:2 1 2q  q 1 q 

1 

4 1   40:40:4

 

2

0:2

3=2

+ q 1 q  1 34

 1   40:40:4

2 2 1=2

1 2q  q 1 q 

2 3=2

+2 q 1 q  : 35 

2

Here wehave used the facts that  1=2,  1    1=4 and 1 2q   1, and Lemma 4. 2Lemma 7

5.4 Finishing Up: Solution of Recurrences

Let us take sto ck of the lower b ounds on G G that wehave proven, and assume that q is small for

t t+1

simplicity. Recall that wehave analyzed the local drop to G , so to compute the global drop wemust

t

2 2

reintro duce w = w `. For Gq =4q 1 q , Lemma 5 shows that G G is on the order of w  q .

t t+1

2

For Gq =H q , Lemma 6 shows that G G is on the order of w  q . Ignoring the dep endence on w

t t+1

and , neither of these drops is on the order of Gq  itself | b oth drops are considerably smaller than the

p

q 1 q , Lemma 7 shows that G G is on the order of amount of remaining p otential. For Gq = 2

t t+1

p

w  q . This is the only case in which the drop is on the order of Gq . We will now see the strong e ect

that these drops have on our nal b ounds. We b egin with the Gini criterion Gq = 4q 1 q .

Theorem 8 Let Gq = 4q 1 q , and let  and G be as de ned in Equations 2 and 3. Then under

t t

the Weak Hypothesis Assumption, for any  2 [0; 1], to obtain    it suces to make

t

2 2

c= 

t  2 36

splits, for some constant c.

Pro of:At round t, there must b e a leaf ` such that 1 w `   =2t and 2 minfq `; 1 q `g =2,

t t

b ecause the total weight of the leaves violating the condition 1 is at most  =2, and the remaining leaves

t

must violate the condition 2 and therefore have error at most  =2. By Lemma 5, splitting at ` using the

t

witness for the Weak Hyp othesis Assumption on the balanced distribution at ` will result in a reduction to

G of at least

t

2 2 2 2

w `  16 q `1 q `  4 w ` minq `; 1 q ` 37

2 3

=2t 38  

t

2 3

=128t: 39  G

t 11

Here wehave used the fact that for Gq =4q 1 q ,   G =4. In other words, wehave the recurrence

t t

inequality

2 3

G

t

G  G : 40

t+1 t

128t

2 2

c=  

>From this it is not dicult to verify that t  2 , for some constant c, suces to obtain   G  .

t t

2Lemma 8

Theorem 9 Let Gq =H q , and let  and G be as de ned in Equations 2 and 3. Then under the

t t

Weak Hypothesis Assumption, for any  2 [0; 1], to obtain    it suces to make

t

2

c log1==

1

t  41



splits, for some constant c.

Pro of:By Equation 2, after t splits there must b e a leaf ` such that w ` minq `; 1 q `   =t.If

t

wenow split at ` using the witness for the Weak Hyp othesis Assumption on the balanced distribution at

2

`, then by Lemma 6, we know that the resulting reduction to G will b e at least w `  q `1 q ` 

t

2 2

w `  minq `; 1 q `=2   =2t. In other words, wemay write

t

2



t

: 42 G  G

t+1 t

2t

1

Now for the choice Gq =H q ,   G  H  by Jensen's Inequality and thus H G    , where

t t t t t

1 1

H y  is de ned to b e the unique x 2 [0; 1=2] such that H x=y . It can b e shown that H y  

y=2 log2=y  for all y 2 [0; 1]. Thus wehave

2 1

H G 

t

G  G 43

t+1 t

2t

2

G

t

: 44  G

t

4t log2=G 

t

p

logt=c

It can b e veri ed that G  e is a solution to this recurrence inequality, as desired. Thus to obtain

t

p

2

logt

c log1==

  G  , it suces to have e  , which requires t  1= . 2Lemma 9

t t

p

q 1 q , and let  and G be as de ned in Equations 2 and 3. Then under Theorem 10 Let Gq =2

t t

the Weak Hypothesis Assumption, for any  2 [0; 1], to obtain    it suces to make

t

2

32=

1

45 t 



splits.

1=2

Pro of:By the de nition of G , there must exist a leaf ` such that 2w `q `1 q `  G =t.By

t t

Lemma 7, if we split at ` using the witness h to the Weak Hyp othesis Assumption on the balanced distribution

at `, then the resulting drop to G will b e at least

t

2 2 2

1 2q ` G q `1 q `G

t t

+ : 46

4t t

2

Nowifq ` 2 [1=4; 3=4] then the second term ab ove is at least 3 G =16, and if q ` 2= [1=4; 3=4] then the

t

2

rst term ab ove is at least G =16. Thus we obtain the recurrence inequality

t

2 2

G

t

= 1 G : 47 G  G

t t+1 t

16t 16t 12

Thus wemay write

2 2 2 2

G  1 1 1  1 48

t

16 2  16 3  16 t  16

k t

4 2 2 2

2 2 2 2

Y Y Y Y

   : 49 1 1 1 1

0 0 0 0

16t 16t 16t 16t

0 0

0 t

0 k

t =3 t =1

t =2 =2+1

t =2 =2+1

Now for any k ,

k k k

2 2

2 =2

2 2 2

Y Y

2

=32

1 1  = 1  e : 50

0 k k

16t 16  2 16  2

0 k 0 k

t =2 =2+1 t =2 =2+1

Thus wehave

2

logt=32

G  e 51

t

as desired. 2Lemma 10

Acknowledgements

Thanks to Tom Dietterich, YoavFreund and Rob Schapire for discussions of the material presented here.

References

[1] In Machine Learning: Proceedings of the International Conference. Morgan Kaufmann, 1982{1995.

[2] A. Blum, M. Furst, J. Jackson, M. Kearns, Y. Mansour, and S. Rudich. Weakly learning DNF and char-

acterizing statistical query learning using Fourier analysis. In Proceedings of the 26th ACM Symposium

on the Theory of Computing.ACM Press, New York, NY, 1994.

[3] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classi cation and Regression Trees.

Wadsworth International Group, 1984.

[4] N. Bshouty and Y. Mansour. Simple learning algorithms for decision trees and multivariate p olynomials.

In Proceedings of the 36th IEEE Symposium on the Foundations of Computer Science, pages 304{311.

IEEE Computer So ciety Press, Los Alamitos, CA, 1995.

[5] N. H. Bshouty. Exact learning via the monotone theory.InProceedings of the 34th IEEE Symposium

on the Foundations of Computer Science, pages 302{311. IEEE Computer So ciety Press, Los Alamitos,

CA, 1993.

[6] W. Buntine and T. Niblett. A further comparison of splitting rules for decision-tree induction. Machine

Learning, 8:75{86, 1992.

[7] T. Dietterich, M. Kearns, and Y. Mansour. Applying the weak learning framework to understand and

improve C4.5. In Machine Learning: Proceedings of the International Conference. Morgan Kaufmann,

1996.

[8] Y. Freund and R. Schapire. Some exp eriments with a new b o osting algorithm. In Machine Learning:

Proceedings of the International Conference. Morgan Kaufmann, 1996.

[9] YoavFreund. Bo osting a weak learning algorithm by ma jority. Information and Computation,

1212:256{285, Septemb er 1995.

[10] YoavFreund and Rob ert E. Schapire. A decision-theoretic generalization of on-line learning and an

application to b o osting. In Second European Conference on Computational Learning Theory, pages

23{37. Springer-Verlag, 1995. 13

[11] J. Jackson. An ecient memb ership query algorithm for learning DNF with resp ect to the uniform

distribution. In Proceedings of the 35th IEEE Symposium on the Foundations of Computer Science.

IEEE Computer So ciety Press, Los Alamitos, CA, 1994.

[12] M. Kearns and R. Schapire. Ecient distribution-free learning of probabilistic concepts. Journal of

Computer and System Sciences, 483:464{497, 1994.

[13] M. Kearns and L. G. Valiant. Cryptographic limitations on learning b o olean formulae and nite au-

tomata. Journal of the ACM, 411:67{95, 1994.

[14] E. Kushilevitz and Y. Mansour. Learning decision trees using the Fourier sp ectrum. In Proc. of the

23rd Symposium on Theory of Computing, pages 455{464. ACM Press, New York, NY, 1991.

[15] J. Mingers. An empirical comparison of selection measures for decision-tree induction. Machine Learning,

3:319{342, 1989.

[16] J.R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.

[17] R. E. Schapire. The strength of weak learnability. Machine Learning, 52:197{227, 1990.

[18] L. G. Valiant. A theory of the learnable. Communications of the ACM, 2711:1134{1142, 1984.

App endix

Lemmas 11, 12 and 13 are needed for the pro of of Lemma 4, which shows that the minimizing value of  for

the function  q;   lies in the range [0:4; 0:6].

G

Lemma 11 Let Gq = 4q 1 q , and let  = q1 q = 1  . Then for  =1=2,

0 0

Gq +   G q = Gq +1    1    G q +1   : 52

0

Pro of:For this choice of Gq , wehave G q =4 8q .Thus wemay write

0

Gq +G q  = 4q 1 q + + 41 2q  53

2 2 2

= 4q q + q  + q  +4 2q +2  54

2

= 4q 1 q +4 : 55

Similarly,we obtain

0 2

Gq +1    1    G q +1   =4q 1 q  + 41    : 56

Setting

2 2

4q 1 q + 4 =4q 1 q  + 41    57

2 2

yields  =1   ,or  =1=2 as desired. 2Lemma 11

Lemma 12 Let Gq = H q , and let  = q1 q = 1  .If <9=35, then for   0:4,

0 0

Gq +   G q   Gq +1    1    G q +1    58

and for   0:6,

0 0

Gq +   G q   Gq +1    1    G q +1   : 59 14

0

Pro of:For the choice Gq =H q = q log q 1 q  log1 q , wehave G q  = log1 q  log q .

Note that

q1 q 

60 q  = q

1 

1 q 

= q 1 61

1 

and

q1 q 

1 q +  = 1 q + 62

1 

q

: 63 = 1 q  1+

1 

Thus, setting

0

F q;  = Gq  + G q  64

we obtain:

0

F q;    = H q + H q  65

1 q  1 q 

= q 1 log q 1

1  1 

q q

1 q  1+ log1 q  1+

1  1 

q 1 q  q1 q 

log1 q  1+ log q 1 66 +

1  1  1 

1 q  q

= q log q 1 1 q  log1 q  1+ 67

1  1 

q 1 q 

1 q  log 1+ 68 = H q  q log 1

1  1 

Similarly,

0

F q; 1    = H q +1   +H q +1    69

1 q  q

= H q  q log 1+ 1 q  log 1 : 70

 

2 3 2

Using the Taylor expansion approximation x x =2+x =3 >ln1 + x >x x =2 for the logarithmic terms

in Equation 70 gives up to second order terms:

2

1 q  1 1 q 

F q; 1     H q  q + q

 2 

2

q 1 q

1 q  + 1 q  71

 2 

2

q 1 q 

= H q + : 72

2

2

Similarly,

2

q 1 q 

F q;     H q + : 73

2

21  

Using the approximations given by Equations 72 and 73, we nd that setting F q;   =F q; 1   

2 2

yields  =1   ,or  =1=2. However, this is not the exact value for  since we ignored the third-order

error term in the Taylor expansions ab ove. 15

Wewould like to show that if  =0:4, then F q;   

3

2

q 1 q  1 1 q 

F q;   

2

21   3 1 

and

3

2

q 1 q  1 1 q 

F q; 1    >Hq + q 75

2

2 3 

Therefore, it suces to show that

3 3

2 2

1 q  1 q  q 1 q  1 1 q 1 q 

+ q q < : 76

2 2

21   3 1  2 3 

Simplifying, wehave

2 2

1 1 q  1 1 q 

+ < 77

2 3 2 3

21   31   2 3

For  =0:4, since 1 q<1, a sucient condition is that <9=35. For the case  =0:6, we write

2

3

q 1 q 1 q 

1 q  + 78 F q; 1   

2

2 3 

and

3

2

q 1 q 1 q 

1 q  : 79 F q;    >Hq +

2

21   3 1 

Again, a sucient condition for F q;    >Fq; 1    is then

3

2 2

3

q 1 q  q q 1 q  q 1 1

: 80 1 q  < 1 q  +

2 2

2 3  21   3 1 

This simpli es to

2 2

1 q 1 q

+ < : 81

2 3 2 3

2 3 21   31  

For  =0:6, this inequality holds for <9=35. 2Lemma 12

p

q 1 q  , and let  = q1 q = 1  .If <0:2 and q is suciently Lemma 13 Let Gq =2

smal l, then for   0:4,

0 0

Gq +   G q   Gq +1    1    G q +1    82

and for   0:6,

0 0

Gq +   G q   Gq +1    1    G q +1   : 83

p

0

Pro of:For this choice of Gq , wehave G q = 1 2q  q 1 q . To simplify notation a bit, let us

de ne

0

F q;  = Gq +G q : 84

We will b e interested in the choices  =  and  = 1   . First, however, wemay write

p

1 2q 

p

F q;  = 2 q 1 q + + 85

q 1 q +

2q 1 q ++1 2q 

p

= 86

q 1 q +

2

2q 2q +2q  

p

87 =

q 1 q +

2q 1 q + 

p

= : 88

q 1 q + 16

Now for  =  = q1 q =1  , wemay write for the denominator of F q;   :

s

p

q1 q  q1 q 

q 1 q +  = q 1 q + 89

1   1  

s

2 2 2

q1 q  q 1 q  q 1 q 

90 = q 1 q +

2

1   1   1  

s

p

1 q  q

= q 1 q  1 1+ : 91

1   1  

For the numerator of F q;   , wehave:

q1 q  q1 q 

2q 1 q +   = 2q 1 q + 92

1   1  

2

q1 q  q 1 q 

93 = 2q 1 q + 2

1   1  

q q1 q 

= 2q 1 q  1+ : 94

1   1  

By Equations 91 and 94, wehave

p

q

q 1 q  1+ 2

1 21 

r

95 F q;   =

1q 

q

1 1+

1 1

or

q

1+

F q;   

1 21 

r

: 96 =

Gq 

1q 

q

1 1+

1 1

Thus, we nd that as q ! 0,

1

F q;   

21 

q

: 97 !

Gq 

1

1

Similar calculations yield that as q ! 0,

1+

F q; 1   

2

p

! : 98

Gq 

1+



Now to complete the pro of of the statement of the lemma for   0:4we need to show:

1 5=6 1+ 5=4

p p

< 99

1 5=3 1+ 5=2

Squaring b oth sides we obtain

2 2

1 5=3 + 25=36 1+5=2 + 25=16

< 100

1 5=3 1+ 5=2

or

2 2

25=36 25=16

< : 101

1 5=3 1+5=2

This last inequality holds for <0:2, as do es the reverse inequality obtained for the value   0:6.

2Lemma 13 17 1

0.8

0.6

0.4

0.2

0 0 0.2 0.4 0.6 0.8 1

q

p

Figure 2: Plots of the three splitting criterion G that we examine, from top to b ottom: Gq =2 q 1 q ;Gq =

H q  and Gq =4q 1 q . The b ottom plot is 2 min q; 1 q , the splitting criterion corresp onding to direct minimization of the lo cal error.

0.2

0.15

0.1

0.05

0 0 0.002 0.004 0.006 0.008 0.01

q

p

Figure 3: Same as Figure 2, but for small q . Notice that the choice Gq =2 q 1 q  top plot enjoys strong

concavity in this regime of q , while Gq =H q  and Gq =4q 1 q  second and third from top are nearly linear

in this regime. 18

weight=w

Pr[f=1]=q

h

h=1 h= 0

prob = 1 - tau prob = tau

Pr[f = 1] = p Pr[f = 1] = r

Figure 4: A typical split lab eled by the function h, showing the split parameters used in the analysis.

Gr

Gq

L

Gp

tau = 1 tau = 0

p q r

delta

Figure 5: E ects of the concavityofG on the lo cal drop to G . Here q =1  p + r,0   1. The lo cal drop to

t

G is equal to the length of the vertical line segment L.Intuitively, for xed p; q and r , the greater the concavityof

t

G, the greater the drop. To show that this drop is signi cant, wemust b ound the three p oints away from each other

| that is, lower b ound  = r p and b ound  away from 0 and 1. 19

fx = 1 fx = 0

Original

Distribution

Scaling = 1/2q Scaling = 1/21-q

contraction expansion

Balanced

hx = 1

Distribution

error error

region

region

Figure 6: Illustration of the pro of of Lemma 2. In going from the original distribution P to the balanced distributio n

`

0

, since in this gure q =Pr P [f x= 1] > 1=2, inputs x such that f x= 1 have their probabiliti es \contracted"

P

`

`

and x such that f x = 0 are \expanded". We compute the error of the Weak Hyp othesis Assumption witness h for

0

by computing the weight of the error regions under P and applying the contraction and the balanced distribution P

`

`

expansion op erations.

tangent to G at p

tangent toGatr

B

C

A

p q r

Figure 7: Illustration of the pro of of Lemma 4. Here q =1  p + r, and the minimizing value of  for the function

 q;   o ccurs when the intersection of tangent line at Gp with the vertical line at q p oint A coincides with

G

the intersection of the tangent line at Gr  with the vertical line at q p oint B . In the gure, these p oints do not

coincide: p oint A still lies b elow p oint B , and thus q must b e moved away from p, to lie directly b elow the desired

tangentintersection p oint C .Thus, wemust increase  to reach the minimizi ng value. 20 G(q) = 4q(1-q), gamma = 0.1, tau = 0.4

q 0 0.2 0.4 0.6 0.8 1 0

-0.002

-0.004

-0.006

-0.008

0 0

Figure 8: For the pro of of Theorem 4. Plot of Gq +  G q  Gq +1    1    G q +1   

for Gq =4q 1 q , with =0:1 and  =0:4. For this value of  ,we see that the function is negative for all values

of q , indicatin g that the minimizing value of  for the function  q;   is larger than 0:4. Formal pro of of this is

G

given by Lemma 11.

G(q) = 4q(1-q), gamma = 0.1, tau = 0.6

0.008

0.006

0.004

0.002

0 0 0.2 0.4 0.6 0.8 1

q

0 0

Figure 9: For the pro of of Theorem 4. Plot of Gq +  G q  Gq +1    1    G q +1   

for Gq =4q 1 q , with =0:1 and  =0:6. For this value of  ,we see that the function is p ositive for all values

of q , indicating that the minimizing value of  for the function  q;   is smaller than 0:6. Formal pro of of this is

G

given by Lemma 11. 21 G(q) = H(q), gamma = 0.1, tau = 0.4 q 0 0.2 0.4 0.6 0.8 1 0

-0.001

-0.002

-0.003

-0.004

-0.005

-0.006

0 0

Figure 10: For the pro of of Theorem 4. Plot of Gq +  G q  Gq +1    1    G q +1   

for Gq =H q , with =0:1 and  =0:4. For this value of  ,we see that the function is negative for all values of

q , indicating that the minimizing value of  for the function  q;   is larger than 0:4. Formal pro of of this is given

G

by Lemma 12.

G(q) = H(q), gamma = 0.1, tau = 0.6

0.006

0.005

0.004

0.003

0.002

0.001

0 0 0.2 0.4 0.6 0.8 1

q

0 0

Figure 11: For the pro of of Theorem 4. Plot of Gq +  G q  Gq +1    1    G q +1   

for Gq =H q , with =0:1 and  =0:6. For this value of  ,we see that the function is p ositive for all values of q ,

indicating that the minimizing value of  for the function  q;   is smaller than 0:6. Formal pro of of this is given

G

by Lemma 12. 22 G(q) = 2*sqrt(q(1-q)), gamma = 0.1, tau = 0.4

-0.001

-0.002

-0.003

-0.004

-0.005 0 0.2 0.4 0.6 0.8 1

q

0 0

Figure 12: For the pro of of Theorem 4. Plot of Gq +  G q  Gq +1    1    G q +1   

p

for Gq =2 q 1 q , with =0:1 and  =0:4. For this value of  ,we see that the function is negative for all

values of q , indicating that the minimizin g value of  for the function  q;   is larger than 0:4. Formal pro of of

G

this is given by Lemma 13.

G(q) = 2*sqrt(q(1-q)), gamma = 0.1, tau = 0.6

0.005

0.004

0.003

0.002

0.001

0 0.2 0.4 0.6 0.8 1

q

0 0

Figure 13: For the pro of of Theorem 4. Plot of Gq +  G q  Gq +1    1    G q +1   

p

for Gq =2 q 1 q , with =0:1 and  =0:6. For this value of  ,we see that the function is p ositive for all

values of q , indicating that the minimizing value of  for the function  q;   is smaller than 0:6. Formal pro of of

G

this is given by Lemma 13. 23