On the Bo osting AbilityofTop-Down
Decision Tree Learning Algorithms
Michael Kearns Yishay Mansour
AT&T Research Tel-Aviv University
May 1996
Abstract
We analyze the p erformance of top-down algorithms for decision tree learning, such as those employed
by the widely used C4.5 and CART software packages. Our main result is a pro of that such algorithms
are boosting algorithms. By this we mean that if the functions that lab el the internal no des of the
decision tree can weakly approximate the unknown target function, then the top-down algorithms we
study will amplify this weak advantage to build a tree achieving any desired level of accuracy. The b ounds
we obtain for this ampli catio n showaninteresting dep endence on the splitting criterion used by the
top-down algorithm. More precisely, if the functions used to lab el the internal no des have error 1=2
as approximation s to the target function, then for the splitting criteria used by CART and C4.5, trees
2 2 2
O 1= O log 1==
of size 1= and 1= resp ectively suce to drive the error b elow .Thus for
example, a small constant advantage over random guessing is ampli ed to any larger constant advantage
with trees of constant size. For a new splitting criterion suggested by our analysis, the much stronger
2
O 1=
b ound of 1= which is p olynomial in 1= is obtained, whichisprovably optimal for decision
tree algorithms. The di ering b ounds have a natural explanation in terms of concavity prop erties of the
splitting criterion.
The primary contribution of this work is in proving that some p opular and empiricall y successful
heuristics that are based on rst principles meet the criteria of an indep endently motivated theoretical
mo del.
A preliminary version of this pap er app ears in Proceedings of the Twenty-Eighth Annual ACM Symposium on the Theory of
Computing, pages 459{468, ACM Press, 1996. Authors' addresses: M. Kearns, AT&T Research, 600 Mountain Avenue, Ro om
2A-423, Murray Hill, New Jersey 07974; electronic mail [email protected]. Y. Mansour, Department of Computer
Science, Tel Aviv University,Tel Aviv, Israel; electronic mail [email protected]. Y. Mansour was supp orted in part by
the Israel Science Foundation, administered by the Israel Academy of Science and Humanities, and by a grant of the Israeli
Ministry of Science and Technology.
1 Intro duction
In exp erimental and applied machine learning work, it is hard to exaggerate the in uence of top-down
heuristics for building a decision tree from lab eled sample data. These algorithms grow a tree from the
ro ot to the leaves by rep eatedly replacing an existing leaf with an internal no de, thus \splitting" the data
reaching this leaf into two new leaves, and reducing the empirical error on the given sample. The tremendous
p opularityofsuch programs which include the C4.5 and CART software packages [16, 3] is due to their
eciency and simplicity, the advantages of using decision trees such as p otential interpretabilitytohumans,
and of course, to their success in generating trees with go o d generalization ability that is, p erformance on
new data. Dozens of pap ers describing exp eriments and applications involving top-down decision tree
learning algorithms app ear in the machine learning literature eachyear [1].
There has b een diculty in nding natural theoretical mo dels that provide a precise yet useful language
in which to discuss the p erformance of top-down decision tree learning heuristics, to compare variants of
these heuristics, and to compare them to learning algorithms that use entirely di erent representations and
approaches. For example, even if we make the rather favorable assumption that there is a small decision tree
lab eling the data, and that the inputs are distributed uniformly thus, we are in the well-known Probably
Approximately Correct or PAC mo del [18], with the additional restriction to the uniform distribution,
the problem of nding any ecient algorithm with provably nontrivial p erformance remains an elusiveopen
problem in computational learning theory.Furthermore, sup erp olynomial lower b ounds for this same problem
have b een proven for a wide class of algorithms that includes the top-down decision tree approach and also
all variants of this approach that have b een prop osed to date [2]. The p ositive results for ecient decision
tree learning in computational learning theory all make extensive use of membership queries [14,5,4, 11],
which provide the learning algorithm with black-b ox access to the target function exp erimentation, rather
than only an oracle for random examples. Clearly, the need for memb ership queries severely limits the
p otential application of such algorithms, and they seem unlikely to encroach on the p opularity of the top-
down decision tree algorithms without signi cant new ideas. In summary, it seem fair to say that despite
their other successes , the mo dels of computational learning theory have not yet provided signi cant insight
into the apparent empirical success of programs like C4.5 and CART.
In this pap er, we attempt to remedy this state of a airs by analyzing top-down decision tree learning
algorithms in the mo del of weak learning [17, 10, 9]. In the language of weak learning, we prove here that the
standard top-down decision tree algorithms are in fact boosting algorithms. By this we mean that if we make
afavorable and apparently necessary assumption ab out the relationship b etween the class of functions
lab eling the decision tree no des and the unknown target function | namely, that these no de functions
p erform slightly b etter than random guessing as approximations to the target function | then the top-down
algorithms will eventually drive the error b elowany desired value. In other words, the top-down algorithms
will amplify slight advantages over random guessing into arbitrarily accurate hyp otheses. Our results imply,
for instance, that if there is always a no de function whose error with resp ect to the target function on an
appropriate distribution is a non-zero constant less than 1=2 constant say, 49 error, then the top-down
algorithms will drive the error b elowany desired constant for instance, 1 error using only constant tree
size. For certain top-down algorithms, the error can b e driven b elow using a tree whose size is p olynomial
in 1=. See Theorem 1 for a precise statement. Thus, even though the top-down heuristics and the notion
of weak learning were develop ed indep endently,we prove that these heuristics do in fact achieve nontrivial
p erformance in the weak learning mo del. This is p erhaps the most signi cant asp ect of our results: the pro of
that some p opular and empirically successful algorithms that are based on rst principles meet the criteria
of an indep endently motivated theoretical mo del.
An added b ene t of our ability to analyze the top-down algorithms in a standard theoretical mo del is that
it allows comparisons b etween variants. In particular, the choice of the splitting criterion function G used by
the top-down algorithm has a profound e ect on the resulting b ound. We will de ne the splitting criterion
shortly, but intuitively its role is to assign values to p otential splits, and thus it implicitly determines which
leaf will b e split, and which function will b e used to lab el the split. Our analysis yields radically di erent
b ounds for the choices made for G by C4.5 and CART, and indicates that b oth may b e inferior to a new
choice for G suggested by our analysis. Preliminary exp eriments supp orting this view are rep orted in our
recent follow-up pap er [7]. In addition to providing a nontrivial analysis of the p erformance of top-down 1
decision tree learning, the pro of of our results givesanumb er of sp eci c technical insights into the success and
limitations of these heuristics. The weak learning framework also allows comparison b etween the top-down
heuristics and the previous b o osting algorithms designed esp ecially for the weak learning mo del. Perhaps
not surprisingly, the b ounds we obtain for the top-down algorithms are in some sense signi cantly inferior
1
to those obtained for other b o osting algorithms However, as we discuss in Section 3, this is unavoidable
consequence of the fact that we are using decision trees, and is not due to any algorithmic prop erties of the
top-down approach.
2 Top-Down Decision Tree Learning Algorithms
We study a class of learning algorithms that is parameterized bytwo quantities: the node function class F and
the splitting criterion G. Here F is a class of b o olean functions with input domain X , and G :[0; 1] ! [0; 1]
is a function having three prop erties:
G is symmetric ab out 1=2. Thus, Gx= G1 x for any x 2 [0; 1].
G1=2 = 1 and G0 = G1 = 0.
G is concave.
We will call such a function G a permissible splitting criterion. The binary entropy
Gq =H q = q logq 1 q log1 q 1
isatypical example of a p ermissible splitting criterion. We will so on describ e the roles of F and G precisely,
but intuitively the learning algorithms will build decision trees in which the internal no des are lab eled by
functions in F , and the splitting criterion G will b e used by the learning algorithm to determine which leaf
should b e split next, and which function h 2 F to use for the split.
Throughout the pap er, we assume a xed target function f with domain X , and a xed target distribution
P over X .Together f and P induce a distribution on labeled examples hx; f xi.We will assume that the
algorithms we study can exactly compute some rather simple probabilities with resp ect to this distribution
on lab eled examples. If this distribution can instead only b e sampled, then our algorithms will approximate
the probabilities accurately using only a p olynomial amount of sampling, and via standard arguments the
analysis will still hold.
Let T be any decision tree whose internal no des are lab eled by functions in F , and whose leaves are
lab eled byvalues in f0; 1g.We will use leaves T to denote the leaves of T .For each ` 2 leaves T , we
de ne w ` the weight of ` to b e the probability that a random x drawn according to P reaches leaf ` in
T , and we de ne q ` to b e the probability that f x=1 given that x reaches `. Let us assume that each
leaf ` in T is lab eled 0 if q ` 1=2, and is lab eled 1 otherwise we refer to this lab eling as the majority
lab eling. We de ne
X
T = w ` minq `; 1 q `: 2
`2leaves T
Note that T =Pr [T x 6= f x], so T is just the usual measure of the error of T with resp ect to f
P
and P .NowifG is the splitting criterion, we de ne
X
GT = w `Gq `: 3
`2leaves T
It will b e helpful to think of Gq ` as an approximation to minq `; 1 q `, and thus of GT as
an approximation to T . In particular, if G is a p ermissible splitting criterion, then wehave Gq
minq; 1 q for all q 2 [0; 1], and thus GT T for all T . The top-down algorithms we examine make
lo cal mo di cations to the current tree T in an e ort to reduce GT , and therefore hop efully reduce T as
1
The recent pap er of Dietterich et al. [7] gives detailed exp erimental comparisons of top-down decision tree algorithms and
the b est of the standard b o osting metho ds; see also the recent pap er of Freund and Schapire [8]. 2
TopDown t:
F;G
1 Initialize T to b e the single-leaf tree, with binary lab el equal to the ma jority lab el of the sample.
2 while T has fewer than t internal no des:
3 0.
best
4 for each pair `; h 2 leaves T F :
5 GT GT `; h.
6 if then :
best
7 ; ` `; h h.
best best best
8 T T ` ;h .
best best
9 Output T .
Figure 1: Algorithm TopDown t.
F;G
well. We will need notation to describ e these lo cal changes. Thus, if ` 2 leaves T and h 2 F ,we use T `; h
to denote the tree that is the same as T , except that wenow make a new internal no de at ` and lab el this
no de by the function h. The newly created child leaves ` and ` corresp onding to the outcomes hx=0
0 1
and hx = 1 at the new internal no de are lab eled by their ma jority lab els with resp ect to f and P .
For no de function class F and splitting criterion G,we give pseudo-co de for the algorithm TopDown t
F;G
in Figure 1. This algorithm takes as input an integer t 0, and outputs a decision tree T with t internal
no des. Note that line 5 is the only place in the co de where we make use of the assumption that we can
compute probabilities with resp ect to P exactly, and as mentioned ab ove, these probabilities are suciently
simple that they can b e replaced by a p olynomial amount of sampling from an oracle for lab eled examples
via standard arguments see Theorem 1. Two other asp ects of the algorithm deserve mention here. First,
although wehave written the search for ` ;h explicitly as an exhaustive search in the for lo op of
best best
line 4, clearly for certain F a more direct minimization might b e p ossible, and in fact necessary in the case
that F is uncountably in nite for example, if F were the class of all thresholded linear combinations of two
input variables. Second, all of our results still hold with appropriate quanti cation if we can only nd a
pair `; h that approximates .For example, if wehave a heuristic metho d for always nding `; h such
best
that = GT GT `; h =2, then our main results hold without mo di cation.
best
Our primary interest will b e in the error of the tree T output by TopDown t as a function of t.To
F;G
emphasize the dep endence on t, for xed F and G we use to denote T and G to denote GT . We
t t
think of G as a \p otential function" that hop efully decreases with each new split.
t
Algorithms similar to TopDown t are in widespread use in b oth applications and the exp erimental
F;G
machine learning community[1,16,3]. There are of course many imp ortant issues of implementation that we
have omitted that must b e addressed in practice. Foremost among these is the fact that in applications, one
do es not usually have an oracle for the exact probabilities, or even an oracle for random examples, but only
a random sample S of xed often small size. To address this, it is common practice to use the empirical
distribution on S to rst grow the tree in the top-down fashion of Figure 1 until is actually zero which
t
may require t to b e on the order of jS j, and then use the splitting criterion G again to prune the tree. This
is done in order to avoid the phenomenon known as over tting , in which the error on the sample and the
error on the distribution diverge [16, 3]. Despite such issues, our idealization TopDown t captures the
F;G
basic algorithmic ideas b ehind many widely used decision tree algorithms. For example, it is fair to think
of the p opular C4.5 software package as a variantofTopDown t in which Gq is the binary entropy
F;G
function, and F is the class of single variables pro jection functions [16]. Similarly, the CART program uses
the splitting criterion Gq =4q 1 q , known as the Gini criterion [3]. We feel that the analysis given here
provides some insightinto why such top-down decision tree algorithms succeed, and what their limitations
are.
Let us now discuss the choice of the no de function class F in more detail. Intuitively, the more p owerful
the class of no de functions, the more rapidly we might exp ect G and to decay with t. This is simply b ecause
t t
more p owerful no de functions may allow us to represent the same function more succinctly as a decision
tree. Of course, the more p owerful F is, the more computationally exp ensive TopDown t b ecomes:
F;G 3
even if the choice of F is such that we can replace the naive for lo op of line 4 by a direct computation of
` ;h , we still exp ect this minimization to require more time as F b ecomes more p owerful. Thus there
best best
is a trade-o b etween the expressivepower of F , and the running time of TopDown t.
F;G
In the software packages C4.5 and CART, the default is to err on the side of simplicity: typically just the
pro jections of the input variables and their negations or slightly more p owerful classes are used as the no de
functions. The implicit attitude taken is that in practical applications, exp ensive searches over p owerful
classes F are simply not feasible, so we ensure the computational eciency of computing ` ;h or
best best
at least a go o d approximation, and hop e that a simple F is suciently p owerful to signi cantly reduce
G with each new internal no de added. In our analysis, the class F is a parameter. Our approachmay
t
b e paraphrased as follows: assuming that wehave made a \favorable" choice of F , what can wesay ab out
the rate of decayofG and as a function of t? Of course, for this approachtobeinteresting, we need
t t
a reasonable de nition of what it means to have made a \favorable" choice of F , and it is clear that this
de nition must say something ab out the relationship b etween F and the target function f . In the next
section, we adopt the Weak Hypothesis Assumption motivated by and closely related to the mo del of Weak
Learning [13, 17, 10] to quantify this relationship.
We defer detailed discussion of the choice of the p ermissible splitting criterion G, since one of the main
results of our analysis is a rather precise reason why some choices maybevastly preferable to others. Here
it suces to simply emphasize that di erentchoices of G can in fact result in di erent trees b eing built:
thus, b oth ` and h may dep end strongly on G. The b est choice of G in practice is far b eing from
best best
a settled issue, as evidenced by the fact that the two most p opular decision tree learning packages C4.5
and CART use di erentchoices for G. There have also b een a numb er of exp erimental pap ers examining
various choices [15, 6]. Perhaps the insights in this pap er most relevant to the practice of machine learning
are those regarding the b ehavior of TopDown as a function of G [7].
F;G
3 The Weak Hyp othesis Assumption
Wenow quantify what we mean by a \favorable" choice of the no de splitting class F . The de nition we
adopt is essentially the one used bya numb er of previous pap ers on the topic of weak learning [17, 9, 10].
De nition 1 Let f be any boolean function over an input space X . Let F be any class of boolean
functions over X .Let 2 0; 1=2]. We say that f -satis es the Weak Hyp othesis Assumption with resp ect
to F if for any distribution P over X , there exists a function h 2 F such that Pr [hx 6= f x] 1=2 .
P
If P is a class of distributions, we say that f -satis es the Weak Hypothesis Assumption with respect to F
over P if the statement holds for any P 2P.Wecal l the parameter the advantage.
It is worth mentioning that this de nition can b e extended to the case where F is a class of probabilistic
b o olean functions [12]. All of our results hold for this more general setting.
Note that if F actually contains the function f , then f trivially 1=2-satis es the Weak Hyp othesis
Assumption with resp ect to F .IfF do es not contain f , then the Weak Hyp othesis Assumption amounts
to an assumption on f . More precisely, itisknown that Weak Hyp othesis Assumption is equivalenttothe
assumption that on any distribution, f can b e approximated by thresholded linear combinations of functions
in F [9].
Compared to the PAC mo del and its variants, in the Weak Hyp othesis Assumption we take a more
incremental view of learning: rather than assuming that the target function lies in some large, xed class,
and trying to design an ecient algorithm for searching this class, we instead hop e that \simple" functions
in our case, the decision tree no de functions are already slightly correlated with the target function, and
analyze an algorithm's ability to amplify these slight correlations to achieve arbitrary accuracy. Based on
our results here and our remarks in the intro duction regarding the known results for decision tree learning
in the PAC mo del, it seems that the weak learning approach is p erhaps more enlightening for the decision
tree heuristics we analyze.
Under the Weak Hyp othesis Assumption, several previous pap ers have prop osed boosting algorithms that
combine many di erent functions from F , each with a small predictive advantage over random guessing on
a di erent ltered distribution , to obtain a single hybrid function whose generalization error on the target 4
distribution P is less than any desired value [17,9,10]. The central question examined for such algorithms
is: As a function of the advantage and the desired error ,how many functions must b e combined, and
how many random examples drawn? Several b o osting algorithms enjoyvery strong upp er b ounds on these
quantities [17,9,10]: the numb er of functions that must b e combined is p olynomial in 1= and log1=,
and the numb er of examples required is p olynomial in 1= and 1=. These b o osting metho ds represent their
hyp othesis as a thresholded linear combination of functions from F ,a choice that is obviously well-suited to
the Weak Hyp othesis Assumption in light of the remarks ab ove.
The goal of this pap er is to give a similar analysis for top-down decision tree algorithms: as a function of
and ,how many no des must the constructed tree have that is, how large must t b e b efore the error is
t
driven b elow ?However, we can immediately dismiss the idea that any decision tree learning algorithm top-
down or otherwise can achieve p erformance comparable to that achieved by algorithms sp eci cally designed
n
for the weak learning mo del. To see this, supp ose the input space X is f0; 1g , and that f is the ma jority
function. Then if F is the class of single-variable pro jection functions, it is known that f 1=n-satis es the
Weak Hyp othesis Assumption with resp ect to F .However, the smallest decision tree with generalization
error less than 1=4 with resp ect to f on the uniform input distribution has size exp onential in n.>From this
example we can immediately infer that t must b e exp onential in 1= in order for TopDown ttoachieve
F;G
any nontrivial p erformance. Note, however, that this limitation is entirely representational | it arises solely
from the fact that wehavechosen to learn using a decision tree over F , not from any prop erties of the
algorithm we use to construct the tree. A more sophisticated construction due to Freund [9] and discussed
2
c=
following Theorem 1 implies that this representational lower b ound can b e improved to 1= for
some constant c>0.
We will show that for appropriately chosen G, algorithm TopDown t in fact achieves this optimal
F;G
2
c=
b ound of O 1= .
It is reasonable to ask whether some assumption other than the Weak Hyp othesis Assumption might
p ermit even more favorable analyses of top-down decision tree algorithms. In this regard, we brie y note
that, like other b o osting analyses, for our analysis a signi cantweakening of the Weak Hyp othesis Assumption
in fact suces. Namely, our results hold if f satis es the Weak Hyp othesis Assumption over P , where P
contains the target distribution P and all those lterings of P that are constructed by the algorithm this
notion will b e made more precise in the analysis. Furthermore, it can b e shown that this weaker assumption
is in fact implied by the rapid decayof . In other words, this weakened assumption holds if and only if
t
TopDown t p erforms well. Thus the Weak Hyp othesis Assumption seems to b e the correct framework
F;G
in which to analyze algorithm TopDown t.
F;G
4 Statement of the Main Result
Our main result gives upp er b ounds on the error of TopDown t for three di erentchoices of the split-
t
F;G
ting criterion G, under the assumption that the target function -satis es the Weak Hyp othesis Assumption
for F . Equivalently,we give upp er b ounds on the value of t required to satisfy for any given .
t
Two of our choices for G are motivated by the p opular implementations of TopDown t previously
F;G
discussed. The Gini criterion used by the CART program is Gq =4q 1 q , and the entropy criterion used
by the C4.5 software package is Gq =H q = q logq 1 q log1 q . It is easily veri ed that b oth
p
are p ermissible. The third criterion we examine is Gq = 2 q 1 q . This new choice is motivated by our
analysis: it is the splitting criterion for whichwe obtain the b est b ounds. A plot of these three functions is
given in Figure 2.
Theorem 1 Let F be any class of boolean functions, let 2 0; 1=2], and let f be any target boolean
function that -satis es the Weak Hypothesis Assumption with respect to F .Let P be any target distribution,
and let T be the tree output by TopDown t. Then for any , the error T is less than provided that
F;G
2 2
c= log1=
1
t if Gq = 4q 1 q ; 4
5
for some constant c> 0;orprovided that
2
c log1==
1
t if Gq = H q ; 5
for some constant c> 0;orprovided that
2
c=
p
1
q 1 q ; 6 if Gq = 2 t
for some constant c>0.Furthermore, suppose that instead of the ability to compute probabilities exactly
with respect to P ,weare only given an oracle for random examples hx; f xi. Consider taking a random
sample S of m examples, and running algorithm TopDown t using the empirical distribution on S .
F;G
There exists a polynomial plog1= ; 1=; t, such that if m is larger than plog1= ; 1=; t and t satis es
the above conditions, then with probability greater than 1 , the error T of T with respect to f and P
wil l be smal ler than .
In terms of the dep endence on the advantage parameter , all three b ounds are exp onential in 1= .
As wehave discussed, the ma jority example shows that this dep endence is an unavoidable consequence of
using decision trees. In terms of dep endence on ,however, there are vast di erences b etween the b ounds for
di erentchoices of G.For the Gini criterion, the b ound is exp onential in 1=, and for the entropy criterion of
p
C4.5 it is sup erp olynomial but sub exp onential. Only for the new criterion Gq = 2 q 1 q dowe obtain
a b ound p olynomial in 1=. The following argument, based on ideas of Freund [9], demonstrates that this
b ound is in fact optimal for decision tree algorithms top-down or otherwise. Let f b e the target function,
and supp ose that the class F of splitting functions contains just a single probabilistic function h such that for
any input x, Pr[hx 6= f x]=1=2 . Here the probability is taken only over the internal randomization
of h.Intuitively, to approximate f , one would take many copies of h each with indep endent random bits
and take the ma jorityvote. For any xed tree size t, the optimal decision tree over F will simply b e the
complete binary tree of depth logt in which eachinternal no de is lab eled by h that is, by a copyofh with
its own indep endent random bits. However, it is not dicult to prove that in order for the tree to have
2
error less than , the depth of the tree must b e 1= log1=, resulting in a lower b ound on the size t
p
2
c=
of 1= for some constant c>0. Thus, the b ound given for the choice Gq =2 q 1 q given in
Theorem 1 is optimal for decision tree learning algorithms. We do not have matching lower b ounds for the
upp er b ounds given in Theorem 1 for the Gini and entropy splitting criteria. The pro of of Theorem 1 will
show that there are go o d technical reasons for b elieving that the claimed di erences b etween the criteria are
qualitatively real, and this seems to b e b orne out by preliminary exp erimental results [7].
The remainder of the pap er is devoted to the pro of of Theorem 1. We emphasize and discuss the parts
of the analysis that shed light on imp ortant issues such as the reason the Weak Hyp othesis Assumption is
helpful for TopDown t, and the di erences b etween the various choices for G.
F;G
5 Pro of of the Main Result
Our approach is to b ound G as a function of t and , and therefore b ound as well. The main idea is to
t t
show that G decreases by at least a certain amount with each new split.
t
Let us x a leaf ` of T , and use the shorthand w = w ` and q = q ` recall that w ` is the probability
of reaching `, and q ` is the probability that the target function is 1 given that wehave reached `. Now
supp ose we split the leaf ` by replacing it by a new internal no de lab eled by the test function h 2 F that
is, T b ecomes T `; h. Then wehavetwo new child leaves of the now-internal no de `, one corresp onding
to the test outcome hx = 0 and the other corresp onding to the test outcome hx = 1. Let us refer to
these leaves as ` and ` resp ectively, and let denote the probability that x reaches ` given that x reaches
0 1 1
the internal no de ` thus, w ` =1 w and w ` =w. Let p = q ` and r = q ` ; then wehave
0 1 0 1
q =1 p + r. This simply says that the total probability that f x=1must b e preserved after splitting
at `. See Figure 4 for a diagram summarizing the split parameters. 6
We are now in p osition to make some simple but central insights regarding how the split at ` reduces
the quantity G . Note that since q =1 p + r and 2 [0; 1], one of p and r is less than or equal to
t
q and the other is greater than or equal to q . Without loss of generality, let p q r .Now b efore the
split, the contribution of the leaf ` to G was w Gq . After the split, the contribution of the now-internal
t
no de ` to G is w 1 Gp+ Gr . Thus the decrease to G is caused by the split is exactly
t+1 t
w Gq 1 Gp Gr . It will b e convenient to analyze the local decrease to G ,thus assuming
t
w = 1, and reintro duce the w factor later. Hence, we are interested in lower bounding the quantity
Gq 1 Gp Gr : 7
This quantity is simply the di erence b etween the value of the concave function G at p oint q , whichisa
linear combination of p and r , and the value of the same linear combination of Gp and Gr ; see Figure 5.
De ne = r p. It is easy to see that if is small, or if is near 0 or 1, then Gq 1 Gp Gr
will b e small. Thus to lower b ound the decrease to G ,we will rst need to argue that under the Weak
t
Hyp othesis Assumption these conditions cannot o ccur. Towards this goal, let P denote the distribution of
`
0 0
inputs reaching `, and let the \balanced" distribution P b e de ned by P x=P x=2q iff x=1and
`
` `
0 0
P x=P x=21 q if f x = 0. Thus P is simply P mo di ed to give equal weight to the p ositive
` `
` `
and negative examples of f see Figure 6. Lemma 2 b elow shows that if the function h used to split at `
witnesses the Weak Hyp othesis Assumption for the balanced distribution at `, then cannot b e to o small,
and cannot b e to o near 0 or 1.
5.1 Constraints on and from the Weak Hyp othesis Assumption
Lemma 2 Let q , and be as de nedabove for the split at leaf `.Let P be the distribution induced
`
0
be the balanced distribution at `. If the function h placedat by P on those inputs reaching `, and let P
`
0
0
, then ` obeys Pr [hx 6= f x] 1=2 thus, h witnesses the Weak Hypothesis Assumption for P
P
`
`
1 q1 q .
Pro of:Recall that =Pr [hx = 1] and r =Pr [f x=1jhx = 1]. It follows that Pr [f x=
P P P
` ` `
1;hx = 1] = r and therefore Pr [f x=0;hx = 1] = r = 1 r . In going from P to
P `
`
0
the balanced distribution P , inputs x such that f x = 0 are scaled by the factor 1=21 q and thus
`
0
Pr [f x=0;hx=1]=1=21 q 1 r . Similarly, since q =Pr [f x = 1], wehavePr [f x=
P P
P
` `
`
0
1;hx=0]= q r, and this is scaled by the factor 1=2q to obtain Pr [f x=1;hx = 0] = 1=2q q r .
P
`
0
See Figure 6. Thus, wemay write an exact expression for Pr [hx 6= f x]:
P
`
1 1 1 1 r r
Pr[hx 6= f x] = 1 r + q r= + : 8
0
P 21 q 2q 2 2 1 q q
`
0
By the assumption on h,Pr [hx 6= f x] 1=2 .Thus,
P
`
r 1 r