Additive Logistic Regression: a Statistical View of

Bo osting



Jerome Friedman



Trevor Hastie

y

Robert Tibshirani

July 23, 1998

Abstract

Bo osting (Freund & Schapire 1996, Schapire & Singer 1998) is one

of the most imp ortant recent developments in classi cation metho d-

ology. The p erformance of many classi cation algorithms often can

b e dramatically improved by sequentially applying them to reweighted

versions of the input data, and taking a weighted ma jority vote of the

sequence of classi ers thereby pro duced. We show that this seemingly

mysterious phenomenon can b e understo o d in terms of well known

statistical principles, namely additive mo deling and maximum likeli-

ho o d. For the two-class problem, b o osting can b e viewed as an ap-

proximation to additive mo deling on the logistic scale using maximum

Bernoulli likeliho o d as a criterion. We develop more direct approx-

imations and show that they exhibit nearly identical results to that

of b o osting. Direct multi-class generalizations based on multinomial

likeliho o d are derived that exhibit p erformance comparable to other

recently prop osed multi-class generalizations of b o osting in most sit-

uations, and far sup erior in some. We suggest a minor mo di cation

to b o osting that can reduce computation, often by factors of 10 to

50. Finally, we apply these insights to pro duce an alternative formu-

lation of b o osting decision trees. This approach, based on b est- rst

truncated tree induction, often leads to b etter p erformance, and can

provide interpretable descriptions of the aggregate decision rule. It is

also much faster computationally making it more suitable to large scale

applications.



Department of Statistics, Sequoia Hall, , Stanford California 94305;

fjhf,[email protected]

y

Department of Public Health Sciences, and Department of Statistics, University of

Toronto; [email protected] 1

1 Intro duction

The starting p oint for this pap er is an interesting pro cedure called \b o ost-

ing", which is a way of combining or b o osting the p erformance of many

\weak" classi ers to pro duce a p owerful \committee". Bo osting was pro-

p osed in the literature (Freund & Schapire 1996) and has

since received much attention.

While b o osting has evolved somewhat over the years, we rst describ e the

most commonly used version of theAdaBoost pro cedure (Freund & Schapire

1996), which we call \Discrete" AdaBo ost. Here is a concise description

of AdaBo ost in the two-class classi cation setting. We have training data

(x ; y );::: (x ; y ) with x a vector valued feature and y = 1 or 1. We

1 1 N N i i

P

M

c f (x) where each f (x) is a classi er pro ducing val- de ne F (x) =

m m m

1

ues 1 and c are constants; the corresp onding prediction is sign(F (x)).

m

The AdaBo ost pro cedure trains the classi ers f (x) on weighted versions

m

of the training sample, giving higher weight to cases that are currently mis-

classi ed. This is done for a sequence of weighted samples, and then the

nal classi er is de ned to b e a linear combination of the classi ers from

each stage. We describ e the pro cedure in more detail in Algorithm 1

Discrete AdaBo ost(Freund & Schapire 1996)

1. Start with weights w = 1=N , i = 1;::: ;N .

i

2. Rep eat for m = 1; 2;::: ;M :

(a) Estimate the classi er f (x) from the training data with weights

m

w .

i

(b) Compute e = E [1 ], c = log ((1 e )=e ).

m w m m m

(y 6=f (x))

m

(c) Set w w exp[c  1 ]; i = 1; 2;::: N , and renormalize

i i m

(y 6=f (x ))

m

i i

P

w = 1. so that

i

i

P

M

c f (x)] 3. Output the classi er sign[

m m

m=1

Algorithm 1: E is the expectation with respect to the weights w =

w

(w ; w ; : : : w ). At each iteration AdaBoost increases the weights of the obser-

1 2 n

vations misclassi ed by f (x) by a factor that depends on the weighted training

m

error.

Much has b een written ab out the success of AdaBo ost in pro ducing accu-

rate classi ers. Many authors have explored the use of a tree-based classi er

for f (x) and have demonstrated that it consistently pro duces signi cantly

m 2

lower error rates than a single decision tree. In fact, Breiman (NIPS work-

shop, 1996) called AdaBo ost with trees the \b est o the shelf classi er in

the world". Interestingly, the test error seems to consistently decrease and

then level o as more classi ers are added, rather than ultimately increase.

For some reason, it seems that AdaBo ost is immune to over tting.

Figure 1 shows the p erformance of Discrete AdaBo ost on a synthetic

TM

classi cation task, using a adaptation of CART (Breiman, Friedman, Ol-

shen & Stone 1984) as the base classi er. This adaptation grows xed-size

trees in a \b est- rst" manner (see Section 7). Included in the gure is the

bagged tree (Breiman 1996) which averages trees grown on b o otstrap resam-

pled versions of the training data. Bagging is purely a variance-reduction

technique, and since trees tend to have high variance, bagging often pro duces

go o d results.

Early versions of AdaBo ost used a resampling scheme to implement

step 2 of Algorithm 1, by weighted imp ortance sampling from the train-

ing data. This suggested a connection with bagging, and that a ma jor

comp onent of the success of b o osting has to do with variance reduction.

However, b o osting p erforms comparably well when:

 a weighted tree-growing algorithm is used in step 2 rather than weighted

resampling, where each training observation is assigned its weight w .

i

This removes the randomization comp onent essential in bagging.

 \stumps" are used for the weak learners. Stumps are single-split trees

with only two terminal no des. These typically have low variance but

high bias. Bagging p erforms very p o orly with stumps (Fig. 1[top-right

panel].)

These observations suggest that b o osting is capable of b oth bias and variance

reduction, and thus di ers fundamentally from bagging.

The base classi er in Discrete AdaBo ost pro duces a classi cation rule

f (x) : X 7! f1; 1g, where X is the domain of the predictive features

m

x. If the implementation of the base classi er cannot deal with observation

weights, weighted resampling is used instead. Freund & Schapire (1996) and

Schapire & Singer (1998) have suggested various mo di cations to improve

the b o osting algorithms; here we fo cus on a version due to Schapire & Singer

(1998), which we call \Real AdaBo ost", that uses real-valued \con dence-

rated" predictions rather than the f1; 1g of Discrete AdaBo ost. The base

classi er for this generalized b o osting pro duces a mapping f (x) : X 7! R ;

m

the sign of f (x) gives the classi cation, and jf (x)j a measure of the \con-

m m

dence" in the prediction. This real-valued b o osting tends to p erform the 3 10 Node Trees Stumps

Bagging Discrete AdaBoost Real AdaBoost Test Error Test Error 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4

0 100 200 300 400 0 100 200 300 400

Number of Terms Number of Terms

100 Node Trees Test Error 0.0 0.1 0.2 0.3 0.4

0 100 200 300 400

Number of Terms

Figure 1: Test error for Bagging (BAG), Discrete AdaBoost (DAB) and Real

AdaBoost (RAB) on a simulated two-class nested spheres problem (see Section 5.)

There are 2000 training data points in 10 dimensions, and the Bayes error rate

is zero. Al l trees are grown \best- rst" without pruning. The left-most iteration

corresponds to a single tree. 4

b est in our simulated examples in Fig. 1, esp ecially with stumps, although

we see with 100 no de trees Discrete AdaBo ost overtakes Real AdaBo ost

after 200 iterations.

Real AdaBo ost(Schapire & Singer 1998)

1. Start with weights w = 1=N , i = 1; 2;::: ;N .

i

2. Rep eat for m = 1; 2;::: ;M :

(a) Estimate the \con dence rated" classi er f (x) : X 7! R and

m

the constant c from the training data with weights w .

m i

(b) Set w w exp[c  y f (x )]; i = 1; 2;::: N , and renormalize

i i m i m i

P

w = 1. so that

i

i

P

M

c f (x)] 3. Output the classi er sign[

m m

m=1

Algorithm 2: The Real AdaBoost algorithm al lows for the estimator f (x) to

m

range over R . In the special case that f (x) 2 f1; 1g it reduces to AdaBoost, since

m

y f (x ) is 1 for a correct and1 for an incorrect classi cation. In the general case

i m i

the constant c is absorbed into f (x). We describe the Schapire-Singer estimate

m m

for f (x) in Section 3.

m

Freund & Schapire (1996) and Schapire & Singer (1998) provide some

theory to supp ort their algorithms, in the form of upp er b ounds on gen-

eralization error. This theory (Schapire 1990) has evolved in the machine

learning community, initially based on the concepts of PAC learning (Kearns

& Vazirani 1994), and later from game theory (Freund 1995, Breiman 1997).

Early versions of b o osting \weak learners" (Schapire 1990) are far simpler

than those describ ed here, and the theory is more precise. The b ounds and

the theory asso ciated with the AdaBo ost algorithms are interesting, but

tend to b e to o lo ose to b e of practical imp ortance. In practice b o osting

achieves results far more impressive than the b ounds would imply.

In this pap er we analyze the AdaBo ost pro cedures from a statistical

p ersp ective. We show that the underlying mo del they are tting is addi-

tive logistic regression. The AdaBo ost algorithms are Newton metho ds for

optimizing a particular exp onential loss function | a criterion which b e-

haves much like the log-likeliho o d on the logistic scale. We also derive new

b o osting-like pro cedures for classi cation.

In Section 2 we brie y review additive mo delling. Section 3 shows how

b o osting can b e viewed as an additive mo del estimator, and prop oses some

new b o osting metho ds for the two class case. The multiclass problem is 5

studied in Section 4. Simulated and real data exp eriments are discussed

in Sections 5 and 6. Our tree-growing implementation, using truncated

b est- rst trees, is describ ed in Section 7. Weight trimming to sp eed up

computation is discussed in Section 8, and we end with a discussion in

Section 9.

2 Additive Mo dels

P

M

c f (x), although AdaBo ost pro duces an additive mo del F (x) =

m m

m=1

\weighted committee" or \ensemble" sound more glamorous. Additive mo d-

els have a long history in statistics, and we give some examples here.

2.1 Additive Regression Mo dels

We initially fo cus on the regression problem, where the resp onse y is quan-

titative, and we are interested in mo deling the mean E (Y jx) = F (x). The

additive mo del has the form

p

X

F (x) = f (x ): (1)

j j

j =1

Here there is a separate function f (x ) for each of the p input variables

j j

x . More generally, each comp onent f is a function of a small, pre-sp eci ed

j j

subset of the input variables. The back tting algorithm (Friedman & Stuetzle

1981, Buja, Hastie & Tibshirani 1989) is a convenient mo dular algorithm

for tting additive mo dels. A back tting up date is

3 2

X

5 4

: (2) f (x )jx f (x ) E y

j j j

k k

k 6=j

Any metho d or algorithm for estimating a function of x can b e used to

j

obtain an estimate of the conditional exp ectation in (2). In particular, this

can include nonparametric smoothing algorithms, such as lo cal regression

or smo othing splines. In the right hand side, all the latest versions of the

functions f are used in forming the partial residuals. The back tting cycles

k

are rep eated until convergence. Under fairly general conditions, back tting

2

can b e shown to converge to the minimizer of E (y F (x)) (Buja et al. 1989).

2.2 Extended Additive Mo dels

M

More generally, one can consider additive mo dels whose elements ff (x)g

m

1

are functions of p otentially all of the input features x. Usually, in this 6

context, the f (x) are taken to b e simple functions characterized by a set

m

of parameters and a multiplier ,

m

f (x) = b(x ; ): (3)

m m m

The additive mo del then b ecomes

M

X

b(x ; ): (4) F (x) =

m m M

m=1

t

For example, in single hidden layer neural networks b(x ; ) =  ( x) where

 () is a sigmoid function and parameterizes a linear combination of the

input features. In signal pro cessing, wavelets are a p opular choice with

parameterizing the lo cation and scale of a \mother" wavelet b(x ; ). In

M

are generally called \basis functions" since these applications fb(x ; )g

m

1

they span a function subspace.

If least{squares is used as a tting criterion, one can solve for an optimal

set of parameters through a generalized back{ tting algorithm with up dates

3 2

2

X

5 4

b(x ; ) b(x ; ) : (5) f ; g arg min E y

m m

k k

;

k 6=m

Alternatively, one can use a \greedy" forward stepwise approach

2

f ; g arg min E [y F (x) b(x ; )] (6)

m m m1

;

m1

where f ; g are xed at their corresp onding solution values at earlier

k k

1

iterations. This is the approach used by Mallat & Zhang (1993) in \matching

pursuit". There the b(x ; ) represent an over complete wavelet{like basis. In

the language of b o osting, f (x) = b(x ; ) would b e called a \weak learner"

and F (x) (4) the \committee". If decision trees were used as the weak

M

learner the parameters would represent the splitting variables, split p oints,

the constants in each terminal no de, and numb er of terminal no des of each

tree.

Note that the back{ tting pro cedure (5) or its greedy cousin (6) only

require an algorithm for tting a single weak learner (3) to data. This base

algorithm is simply applied rep eatedly to mo di ed versions of the original

data

X

f (x): y y

k

k 6=m 7

In the forward stepwise pro cedure (6) the mo di ed output at the mth iter-

ation y only dep ends on its value y and the solution f (x) at the

m m1 m1

previous iteration

y = y f (x): (7)

m m1 m1

At each step m, the previous output values y are mo di ed (7) so that

m1

the previous mo del f (x) has no explanatory p ower on the new outputs

m1

y . One can therefore view this as a pro cedure for b o osting a weak learner

m

f (x) = b(x ; ) to form a p owerful committee F (x) (4).

1 1 1 M

2.3 Classi cation problems

For the classi cation problem, we learn from Bayes theorem that all we

need is P (y = j jx), the p osterior or conditional class probabilities. One

could transfer all the ab ove regression machinery across to the classi cation

domain by simply noting that E (1 jx) = P (y = j jx), where 1 is the

(y =j ) (y =j )

0=1 indicator variable representing class j . While this works fairly well in

general, several problems have b een noted (Hastie, Tibshirani & Buja 1994)

for constrained regression metho ds. The estimates are typically not con ned

to [0; 1], and severe masking problems can o ccur. A notable exception is

when trees are used as the regression metho d, and in fact this is the approach

used by Breiman et al. (1984).

Logistic regression is a p opular approach used in statistics for overcoming

these problems. For a two class problem, the mo del is

M

X

P (y = 1jx)

f (x): (8) = log

m

P (y = 0jx)

m=1

The monotone logit transformation on the left guarantees that for any values

P

M

f (x) 2 R , the probability estimates lie in [0; 1]; inverting of F (x) =

m

m=1

we get

F (x)

e

: (9) p(x) = P (y = 1jx) =

F (x)

1 + e

Here we have given a general additive form for F (x); sp ecial cases exist

that are well known in statistics. In particular, the linear logistic regres-

sion mo del (McCullagh & Nelder 1989, for example) and additive logistic

regression mo del (Hastie & Tibshirani 1990) are p opular. These mo dels are

usually t by maximizing the binomial log-likeliho o d, and enjoy all the as-

so ciated asymptotic optimality features of maximum likeliho o d estimation.

A generalized version of back tting (2), called \Lo cal Scoring" in (Hastie

& Tibshirani 1990), is used to t the additive logistic mo del. Starting with 8

P

guesses f (x ) ::: f (x ), F (x) = f (x ) and p(x) de ned in (9), we form

1 1 p p

k k

the working resp onse:

y p(x)

z = F (x) + : (10)

p(x)(1 p(x))

We then apply back tting to the resp onse z with observation weights p(x)(1

p(x)) to obtain new f (x ). This pro cess is rep eated until convergence. The

k k

forward stage-wise version (6) of this pro cedure b ears a close similarity to

the LogitBo ost algorithm describ ed later in the pap er.

3 Bo osting | an Additive Logistic Regression Mo del

In this Section we show that the b o osting algorithms are stage-wise esti-

mation pro cedures for tting an additive logistic regression mo del. They

optimize an exp onential criterion which to second order is equivalent to

the binomial log-likeliho o d criterion. We then prop ose a more standard

likeliho o d-based pro cedure.

3.1 An Exp onential Criterion

Consider minimizing the criterion

y F (x)

J (F ) = E (e ) (11)

1

for estimation of F (x).

Lemma 1 shows that the function F (x) that minimizes the L version of

2

the exp onential criterion is the symmetric logistic transform of P (y = 1jx)

y F (x)

Lemma 1 E (e ) is minimized at

P (y = 1jx) 1

log : (12) F (x) =

2 P (y = 1jx)

Hence

F (x)

e

P (y = 1jx) = (13)

F (x) F (x)

e + e

F (x)

e

P (y = 1jx) = : (14)

F (x) F (x)

e + e

1

E represents exp ectation; dep ending on the context, this may b e an L p opulation

2

exp ectation, or else a sample average. E means a weighted exp ectation.

w 9

Pro of

While E entails exp ectation over the joint distribution of y and x, it is

sucient to minimize the criterion conditional on x.

 

y F (x) F (x) F (x)

E e )jx = P (y = 1jx)e + P (y = 1jx)e

 

y F (x)

@E e )jx

F (x) F (x)

= P (y = 1jx)e + P (y = 1jx)e

@F (x)

Setting the derivative to zero the result follows.

2

1

The usual logistic transform do es not have the factor in (12); by mul-

2

F (x)

tiplying the numerator and denominator in (13) by e , we get the usual

logistic mo del

2F (x)

e

(15) p(x) =

2F (x)

1 + e

Hence the two mo dels are equivalent.

Corollary 1 If E is replaced by averages over regions of x where F (x) is

constant (as in the terminal node of a decision tree), the same result applies

to the sample proportions of y = 1 and y = 1.

In prop osition 1 we show that the Discrete AdaBo ost increments in Algo-

rithm 1 are Newton-style up dates for minimizing the exp onential criterion.

This can b e interpreted as a stage-wise estimation pro cedure for tting an

additive logistic regression mo del.

Prop osition 1 The Discrete AdaBoost algorithm produces adaptive Newton

y F (x)

updates for minimizing E (e ), which are stage-wise contributions to an

additive logistic regression model.

Pro of

y F (x)

Let J (F ) = E [e ]. Supp ose we have a current estimate F (x) and

seek an improved estimate F (x) + cf (x). For xed c (and x), we expand

J (F (x) + cf (x)) to second order ab out f (x) = 0

y (F (x)+cf (x))

J (F + cf ) = E [e ]

y F (x) 2 2

 E [e (1 y cf (x) + c f (x) =2)]

Minimizing p ointwise with resp ect to f (x) 2 f1; 1g, we nd

2 2

^

f (x) =2jx) f (x) = arg min E (1 y cf (x) + c

w

f 10

2

= arg min E [(y cf (x)) jx] (16)

w

f

2

= arg min E [(y f (x)) jx] (17)

w

f

where w (y jx) = exp(y F (x))=E exp(y F (x)), and (17) follows from (16)

by considering the two p ossible choices for f (x).

^ ^

Given f 2 f1; 1g, we can directly minimize J (F + cf ) to determine c:

^

cy f (x)

c^ = arg min E e

w

c

1 1 e

= log

2 e

where e = E [1 ]. Combining these steps we get the up date for F (x)

w

^

(y 6=f (x))

1 e 1

^

log f (x) F (x) F (x) +

2 e

^

In the next iteration the new contribution c^f (x) to F (x) augments the

weights:

^

c^f (x)y

w (y jx) w (y jx)  e ;

^

followed by a normalization. Since y f (x) = 2  1 1, we see that

^

(y 6=f (x))

the up date is equivalent to

   

1 e

w (y jx) w (y jx)  exp log 1

^

(y 6=f (x))

e

Thus the function and weight up dates are identical to those used in Discrete

AdaBo ost.

2

Parts of this derivation for AdaBo ost can b e found in Schapire & Singer

(1998). Newton algorithms rep eatedly optimize a quadratic approximation

to a nonlinear criterion, which is the rst part of the up date in AdaBo ost.

This L version of AdaBo ost translates naturally to a data version using

2

trees. The weighted criterion is used to grow the tree-based

^ ^

classi er f (x), and given f (x), the constant c is based on the weighted

training error.

Note that after each Newton step, the weights change, and hence the

tree con guration will change as well. This adds a nonlinear twist to the

Newton algorithm. 11

Corollary 2 After each update to the weights, the weighted misclassi ca-

tion error of the most recent weak learner is 50%.

Pro of

This follows by noting that the c that minimizes J (F + cf ) satis es

@J (F + cf )

y (F (x)+cf (x))

= E [e y f (x)] = 0 (18)

@ c

The result follows since y f (x) is 1 for a correct classi cation, and 1 for a

misclassi cation.

2

Schapire & Singer (1998) give the interpretation that the weights are up-

dated to make the new weighted problem maximally dicult for the next

weak learner.

The Discrete AdaBo ost algorithm exp ects the tree or other \weak learn-

er" to deliver a classi er f (x) 2 f1; 1g. We now show that the Real

AdaBo ost algorithm uses the weighted probability estimates in the termi-

nal no des of the tree to up date the additive logistic mo del, rather than the

classi cations themselves. Again we derive the p opulation algorithm, and

then apply it to data.

Prop osition 2 The Real AdaBoost algorithm ts an additive logistic re-

y F (x)

gression model by stage-wise optimization of J (F ) = E [e ]

Pro of

Supp ose we have a current estimate F (x) and seek an improved estimate

F (x) + f (x) by minimizing J (F (x) + f (x)).

@J (F (x) + f (x))

y F (x) y f (x)

= E (e y e jx)

@ f (x)

y F (x) f (x) y F (x) f (x)

1 e jx] + E [e 1 e jx] = E [e

(y =1) (y =1)

y F (x)

Dividing through by E e and setting the derivative to zero we get

E [1 jx]

1

w

(y =1)

^

log (19) f (x) =

2 E [1 jx]

w

(y =1)

1 P (y = 1jx)

w

= log (20)

2 P (y = 1jx)

w

where w (y jx) = exp(y F (x))=E (exp (y F (x))jx). 12

Careful examination of Schapire & Singer (1998) shows that this up date

matches theirs (and the c in Algorithm 2 are redundant.) The weights get

m

up dated by

^

y f (x)

w (y jx) w (y jx)  e

2

Corollary 3 At the optimal F (x), the weighted conditional mean of y is 0.

Pro of

If F (x) is optimal, we have

@J (F (x))

y F (x)

= E e y = 0 (21)

F (x)

2

We can think of the weights as providing an alternative to residuals for the

binary classi cation problem. At the optimal function F , there is no further

information ab out F in the weighted conditional distribution of y . If there

is, we use it to up date F .

At iteration M in either the Discrete or Real AdaBo ost algorithms, we

have comp osed an additive function of the form

M

X

F (x) = f (x) (22)

m

m=1

where each of the comp onents are found in a greedy forward stage-wise

fashion, xing the earlier comp onents. Our term \stage-wise" refers to a

similar approach in Statistics:

 Variables are included sequentially in a stepwise regression.

 The co ecients of variables already included receive no further adjust-

ment.

y F (x)

3.2 Why E e ?

So far the only justi cation for this exp onential criterion is that it has a

sensible p opulation minimizer, and the algorithm describ ed ab ove p erforms

well on real data. In addition

y F (x)

 Schapire & Singer (1998) motivate e as a di erentiable upp er-

b ound to misclassi cation error (see Fig. 2); 13

 the AdaBo ost algorithm that it generates is extremely mo dular, re-

quiring at each iteration the retraining of a classi er on a weighted

training database.



Let y = (y + 1)=2, taking values 0; 1, and parametrize the binomial

probabilities by

F (x)

e

p(x) =

F (x) F (x)

e + e

The exp ected binomial log-likeliho o d is

  

E `(y ; p(x)) = E [y log (p(x)) + (1 y ) log (1 p(x))] (23)

y F (x)

= E log (1 + e ) (24)

 y F (x)

 The p opulation minimizers of E `(y ; p(x)) and E e coincide.

In fact, the exp onential criterion and the (negative) log-likeliho o d are

equivalent to second order in a Taylor series around F = 0:



`(y ; p)  exp(y F ) + log (2) 1 (25)

y F (x)

Graphs of exp(y F ) and log (1 + e ) (suitably scaled) are shown in

Fig. 2, as a function of y F | p ositive values of y F imply correct clas-

si cation. Note that exp(y F ) itself is not a prop er log-likeliho o d,

as it do es not equal the log of any probability mass function on 1.

 Also shown in Fig. 2 is the indicator function 1 , which gives

(F >0)

misclassi cation error.

 There is another way to view the criterion J (F ). It is easy to show

that



jy p(x)j

y F (x)

p

; (26) e =

p(x)(1 p(x))

with F (x) = log (p(x)=(1 p(x))), The right-hand side is known as the

Chi statistic in the statistical literature.

One feature of b oth the exp onential and log-likeliho o d criteria is that

they are monotone and smo oth. Even if the training error is zero, the criteria

will drive the estimates towards purer solutions (in terms of probability

estimates).

2

Why not estimate the f by minimizing the squared error E (y F (x)) ?

m

P

m1

If F (x) = f (x) is the current prediction, this leads to a forward

m1 j

1 14 Losses as Approximations to Misclassification Error

Misclassification Shapire-Singer Log-likelihood Squared Error (p) Squared Error(F) Loss 0.0 0.5 1.0 1.5 2.0 2.5 3.0

-2 -1 0 1 2

yF

Figure 2: A variety of loss functions for estimating a function F (x) for classi-

cation. The horizontal axis is y F , which is negative for errors and positive for

correct classi cations. Al l the loss functions are monotone in y F . The curve la-

 2

beled \Squared Error(p)" is (y p) , and gives a uniformly better approximation to

misclassi cation loss than the exponential criterion (Schapire-Singer). The curve

2

labeled \Squared Error(F)" is (y F ) , and increases once y F exceeds 1, thereby

increasingly penalizing classi cations that are \too correct". 15

stage-wise pro cedure that do es an unweighted t to the resp onse y F (x)

m1

at step m (6). Empirically we have found that this approach works quite

well, but is dominated by those that use monotone loss criteria. We b elieve

that the non-monotonicity of squared error loss (Fig. 2) is the reason. Cor-

rect classi cations (y F (x) > 1) incur increasing loss for increasing values

of jF (x)j. This makes squared-error loss an esp ecially p o or approximation

to misclassi cation error rate. Classi cations that are \to o correct" are

p enalized as much as misclassi cation errors.

3.3 Using the log-likeliho o d criterion

In this Section we explore algorithms for tting additive logistic regression

mo dels by stage-wise optimization of the Bernoulli log-likeliho o d. Here we



fo cus again on the two-class case, and will use a 0=1 resp onse y to represent



the outcome. We represent the probability of y = 1 by p(x), where

F (x)

e

(27) p(x) =

F (x) F (x)

e + e

Algorithm 3 gives the details.

LogitBo ost (2 classes)

1. Start with weights w = 1=N i = 1; 2;::: ;N , F (x) = 0 and probability

i

1

estimates p = .

i

2

2. Rep eat for m = 1; 2;::: ;M :

(a) Compute the working resp onse and weights



p y

i

i

z =

i

p (1 p )

i i

w = p (1 p )

i i i

(b) Estimate f (x) by weighted least-squares tting of z to x.

m

1

f (x) and p(x) via (27). (c) Up date F (x) F (x) +

m

2

P

M

f (x)] 3. Output the classi er sign[F (x)] = sign[

m

m=1

Algorithm 3: An adaptive Newton algorithm for tting an additive logistic re-

gression model. 16

Prop osition 3 The LogitBoost algorithm uses adaptive Newton steps for

tting an additive symmetric logistic model by maximum likelihood.

Pro of

Consider the up date F (x) + f (x) and the exp ected log-likeliho o d



`(F + f ) = E [2y (F (x) + f (x)) log [1 + exp(2F (x)]: (28)

Conditioning on x, we compute the rst and second derivative at f (x) = 0:

@ `(F (x) + f (x)

s(x) = j

f (x)=0

@ f (x)



= 2E (y p(x)jx) (29)

2

@ `(F (x) + f (x)

j H (x) =

f (x)=0

2

@ f (x)

= 4E (p(x)(1 p(x))jx) (30)

where p(x) is de ned in terms of F (x). The Newton up date is then

1

F (x) F (x) H (x) s(x)



E (y p(x)jx) 1

(31) = F (x) +

2 E p(x)(1 p(x))jx)

 



y p(x) 1

E jx (32) = F (x) +

w

2 p(x)(1 p(x))

where w (x) = p(x)(1 p(x)). Equivalently, the Newton up date solves the

weighted least squares criterion

 

2



1 y p(x)

min E f (x) (33)

w (x)

2 p(x)(1 p(x))

f (x)

2

The p opulation algorithm describ ed here translates immediately to an

implementation on data when E (jx) is replaced by a regression metho d,

such as regression trees (Breiman et al. 1984). While the role of the weights

are somewhat arti cial in the L case, they are not in any implementation;

2

w (x) is constant when conditioned on x, but the w (x ) in a terminal no de of

i

a tree, for example, dep end on the current values F (x ), and will typically

i

not b e constant.

Sometimes the w (x) get very small in regions of (x) p erceived (by F (x))

to b e pure|that is, when p(x) is close to 0 or 1. This can cause numerical

problems in the construction of z , and led to the following crucial imple-

mentation protections: 17



y p

1



 If y = 1, then compute z = . Since this numb er can as

p

p(1p)

get large if p is small, threshold this ratio at zmax . The particular

value chosen for zmax is not crucial; we have found empirically that

1



with zmax 2 [2; 4] works well. Likewise, if y = 0, compute z =

(1p)

a lower threshold of zmax .

 Enforce a lower threshold on the weights: w = max(w ; 2machine-zero ).

y F (x)

3.4 Optimizing E e by Newton stepping

y (F (x)+f (x))

The L Real Adab o ost pro cedure (Algorithm 2) optimizes E e

2

exactly with resp ect to f at each iteration. Here we explore a \gentler"

version that instead takes adaptive Newton steps much like the LogitBo ost

algorithm just describ ed.

Gentle AdaBo ost

1. Start with weights w = 1=N i = 1; 2;::: ;N , F (x) = 0.

i

2. Rep eat for m = 1; 2;::: ;M :

(a) Estimate f (x) by weighted a t of y to x.

m

(b) Up date F (x) F (x) + f (x)

m

y f (x )

m

i i

(c) Up date w w e and renormalize.

i i

P

M

f (x)] 3. Output the classi er sign[F (x)] = sign[

m

m=1

Algorithm 4: A modi ed version of the Real AdaBoost algorithm, using Newton

stepping rather than exact optimization at each step

Prop osition 4 The Gentle AdaBoost algorithm uses adaptive Newton steps

y F (x)

for minimizing E e .

Pro of

@J (F (x) + f (x))

y F (x)

j = E (e y jx)

f (x)=0

@ f (x)

2

@ J (F (x) + f (x))

y F (x)

2

jx) since y = 1 j = E (e

f (x)=0

2

@ f (x) 18

Hence the Newton up date is

y F (x)

E (e y

F (x) F (x) +

y F (x)

E (e jx)

= F (x) + E y

w

where

y F (x)

e

w (y jx) =

y F (x)

E (e jx)

2

The main di erence b etween this and the Real AdaBo ost algorithm is

how it uses its estimates of the weighted class probabilities to up date the

functions. Here the up date is f (x) = P (y = 1jx) P (y = 1jx), rather

m w w

P (y =1jx)

w 1

log . Log-ratios can than half the log-ratio as in (20): f (x) =

m

2

P (y =1jx)

w

b e numerically unstable, leading to very large up dates in pure regions, while

the up date here lies in the range [1; 1]. Empirical evidence suggests (see

Section 6) that this more conservative algorithm has similar p erformance to

b oth the Real AdaBo ost and LogitBo ost algorithms, and often outp erforms

them b oth, esp ecially when stability is an issue.

Freund & Schapire (1996) also prop ose an AdaBo ost algorithm similar

to Gentle Adab o ost, except the function f (x) is multiplied by a constant

m

(c f (x)) which is estimated in a global fashion much like in Discrete Ad-

m m

aBo ost. We do not pursue this particular variant further here.

There is a strong similarity b etween the up dates for the Gentle AdaBo ost

algorithm and those for the LogitBo ost algorithm. Let p = P (y = 1jx), and

F (x)

e

. Then p =

m

F (x) F (x)

e +e

y F (x) F (x) F (x)

E (e y jx) e p e (1 p)

=

y F (x) F (x) F (x)

E (e jx) e p + e (1 p)

p p

m

= (34)

(1 p )p + p (1 p)

m m

The analogous expression for LogitBo ost from (31) is

p p 1

m

(35)

2 p (1 p )

m m

1

these are nearly the same, but they di er as the p b ecome At p 

m m

2

extreme. For example, if p  1 and p  0, (35) blows up, while (34) is

m

ab out 1 (and always falls in [1; 1].) 19

4 Multiclass pro cedures

Here we explore extensions of b o osting to classi cation with multiple classes.

We start o by prop osing a natural generalization of the two-class symmet-

ric logistic transformation, and then consider sp eci c algorithms. In this

context Schapire & Singer (1998) de ne J resp onses y for a J class prob-

j

lem, each taking values in f1; 1g. Similarly the indicator response vector



with elements y is more standard in the statistics literature. Assume the

j

classes are mutually exclusive.

De nition 1 For a J class problem let P (x) = P (y = 1jx). We de ne

j j

the symmetric multiple logistic transformation

J

X

1

log P (x) (36) F (x) = log P (x)

j j

k

J

k =1

Equivalently,

F (x)

j

P

e

J

F (x) = 0 (37) ; P (x) =

P

j

k

k =1

J

F (x)

k

e

k =1

The centering condition in (37) is for numerical stability only; it simply pins

the F down, else we could add an arbitrary constant to each F and the

j j

probabilities remain the same. The equivalence of these two de nitions is

easily established, as well as the equivalence with the two-class case.

Schapire & Singer (1998) provide several generalizations of AdaBo ost

for the multiclass case; we describ e their AdaBoost.MH algorithm, since it

seemed to dominate the others in their empirical studies. We then connect

it to the mo dels presented here. We will refer to the augmented variable in

Algorithm 5 as the \class" variable C . We make a few observations:

P

J

y F (x)

j j

E e , which  The L version of this algorithm minimizes

2

j =1

is equivalent to running separate L b o osting algorithms on each of

2

the J problems of size N obtained by partitioning the N  J samples

in the obvious fashion. This is seen trivially by rst conditioning on

C = j , and then xjC = j , when computing conditional exp ectations.

 The same is almost true for their tree-based algorithm. We see this

b ecause

1. If the rst split is on C | either a J -nary split if p ermitted,

or else J 1 binary splits | then the sub-trees are identical to

separate trees grown to each of the J groups. This will always b e

the case for the rst tree. 20

AdaBo ost.MH (Schapire & Singer 1998)

The original N observations are expanded into N  J pairs

((x ; 1); y ); ((x ; 2); y );::: ; ((x ;J ); y ); i = 1;::: ;N:

i i1 i i2 i iJ

1. Start with weights w = 1=NJ , i = 1;::: ;N; j = 1;::: ;J .

ij

2. Rep eat for m = 1; 2;::: ;M :

(a) Estimate the \con dence rated" classi er f (x; j ) : (X 

m

(1;::: ;J )) 7! R from the training data with weights w .

ij

(b) Set w w exp[y f (x ; j )]; i = 1; 2;::: ;N; j = 1;::: ;J ,

ij ij ij m i

P

and renormalize so that w = 1.

ij

i;j

P

f (x; j ). 3. Output the classi er argmax F (x; j ) where F (x; j ) =

m

j

m

Algorithm 5: The AdaBoost.MH algorithm converts the J class problem into

that of estimating a 2 class classi er on a training set J times as large, with an

additional \feature" de ned by the set of class labels.

2. If a tree do es not split on C anywhere on the path to a terminal

no de, then that no de returns a function f (x; j ) = g (x) that

m m

contributes nothing to the classi cation decision. However, as

long as a tree includes a split on C at least once on every path to

a terminal no de, it will make a contribution to the classi er for

all input feature values.

The advantage/disadvantage of building one large tree using class lab el

as an additional input feature is not clear. No motivation is provided.

We therefore implement AdaBo ost.MH using the more traditional di-

P

J

y F (x)

j j

E e rect approach of building J separate trees to minimize

j =1

We have thus shown

Prop osition 5 The AdaBoost.MH algorithm for a J -class problem ts J

1

log P (x)=(1 P (x)), each uncoupled additive logistic models, G (x) =

j j j

2

class against the rest.

In principal this parametrization is ne, since G (x) is monotone in P (x).

j j

However, we are estimating the G (x) in an uncoupled fashion, and there

j

is no guarantee that the implied probabilities sum to 1. We give some

examples where this makes a di erence, and AdaBo ost.MH p erforms more

p o orly than an alternative coupled likeliho o d pro cedure. 21

Schapire and Singer's AdaBo ost.MH was also intended to cover situa-

tions when observations can b elong to more than one class. The \MH"

represents \Multi-Lab el Hamming", Hamming loss b eing used to measure

J

the errors in the space of 2 p ossible class lab els. In this context tting

a separate classi er for each lab el is a reasonable strategy. Schapire and

Singer also prop ose using AdaBo ost.MH when the class lab els are mutually

exclusive, which is the fo cus in this pap er.

Algorithm 6 is a natural generalization of algorithm 3 for tting the

J -class logistic regression mo del (37).

LogitBo ost (J classes)

1. Start with weights w = 1=N , i = 1;::: ;N; j = 1;::: ;J , F (x) = 0

ij j

and P (x) = 1=J 8j .

j

2. Rep eat for m = 1; 2;::: ;M :

(a) Rep eat for j = 1;::: ;J :

i. Compute working resp onses and weights in the j th class



y p

ij

ij

z =

ij

p (1 p )

ij ij

w = p (1 p )

ij ij ij

ii. Estimate f (x) by a weighted least-squares t of z to x

mj ij i

P

J

1 J 1

(f (x) f (x)), and F (x) (b) Set f (x)

mj j mj

mk

k =1

J J

F (x) + f (x)

j mj

(c) Up date P (x) via (37).

j

3. Output the classi er argmax F (x)

j

j

Algorithm 6: An adaptive Newton algorithm for tting an additive multiple

logistic regression model.

Prop osition 6 The LogitBoost algorithm 6 uses adaptive quasi-Newton steps

for tting an additive symmetric logistic model by maximum-likelihood

We sketch an informal pro of.

Pro of

 We rst give the L score and Hessian for the Newton algorithm corre-

2

sp onding to a standard multi-logit parametrization G (x), where say

j 22

G (x) = 0. The exp ected log-likeliho o d is

J

J 1 J 1

X X

 G (x)+g (x)

k k

E (y jx)(G (x) + g (x)) log (1 + e ) `(G + g ) =

j j

j

j =1

k =1



s (x) = E (y p (x)jx); j = 1;::: ;J 1

j j

j

H (x) = p (x)( p (x)); j; k = 1;::: ;J 1

j

j;k j k k

 Our quasi-Newton up date amounts to using a diagonal approximation

to the Hessian:



p (x)jx) E (y

j

j

g (x) = ; j = 1;::: ;J 1

j

p (x)(1 p (x))

j j

 To convert to the symmetric parametrization, we would note that g =

J

P

J

1

0, and set f (x) = g (x) g (x). However, this pro cedure could

j j

k

k =1

J

b e applied using any class as the base, not just the J th. By averaging

over all choices for the base class, we get the up date

!

 

J





X

E (y p (x)jx)

E (y p (x)jx) 1 J 1

j

k j

k

f (x) =

j

J p (x)(1 p (x)) J p (x)(1 p (x))

j j

k k

k =1

2

5 Simulation studies

In this Section the four avors of b o osting outlined ab ove are applied to

several arti cially constructed problems. Comparisons based on real data

are presented in Section 6.

An advantage of comparisons made in a simulation setting is that all

asp ects of each example are known, including the Bayes error rate and the

complexity of the decision b oundary. In addition, the p opulation exp ected

error rates achieved by each of the resp ective metho ds can b e estimated to

arbitrary accuracy by averaging over a large numb er of di erent training

and test data sets drawn from the p opulation. The four b o osting metho ds

compared here are

DAB: Discrete AdaBo ost | Algorithm 1 23

RAB: Real AdaBo ost | Algorithm 2 for two classes, and Algorithm 5 (Ad-

aBo ost.MH) for more than two classes.

LB: LogitBo ost | Algorithms 3 and 6.

GAB: Gentle AdaBo ost | Algorithm 4

In an attempt to di erentiate p erformance, all of the simulated examples

involve fairly complex decision b oundaries. The ten predictive features for all

examples are randomly drawn from a ten-dimensional standard normal dis-

10

tribution x  N (0;I ). For the rst three examples the decision b oundaries

separating successive classes are nested concentric ten-dimensional spheres

constructed by thresholding the squared-radius from the origin

10

X

2 2

x : (38) r =

j

j =1

Each class C ( 1  k  K ) is de ned as the subset of observations

k

2

C = fx j t  r < t g (39)

i

k k 1 k

i

K 1

with t = 0 and t = 1. The ft g for each example were chosen so as to

0 K

k

1

put approximately equal numb ers of observations in each class. The training

sample size is N = K  1000 so that approximately 1000 training observations

are in each class. An indep endently drawn test set of 10000 observations

was used to estimate error rates for each training set. Averaged results over

ten such indep endently drawn training/test set combinations were used for

the nal error rate estimates. The corresp onding statistical uncertainties

(standard errors) of these nal estimates (averages) are approximately a

line width on each plot.

Figure 3 [top-left] compares the four algorithms in the two-class (K = 2)

case using a two{terminal no de decision tree (\stump") as the base classi er.

Shown is error rate as a function of numb er of b o osting iterations. The upp er

(black) line represents DAB and the other three nearly coincident lines are

the other three metho ds (dotted red = RAB, short-dashed green = LB, and

long-dashed blue=GAB). Note that the somewhat erratic b ehavior of DAB,

esp ecially for less that 200 iterations, is not due to statistical uncertainty.

For less than 400 iterations LB has a minuscule edge, after that it is a dead

heat with RAB and GAB. DAB shows substantially inferior p erformance

here with roughly twice the error rate at all iterations.

Figure 3 [lower-left] shows the corresp onding results for three classes

(K = 3) again with two{terminal no de trees. Here the problem is more 24 Additive Decision Boundary

Stumps - 2 Classes Eight Node Trees - 2 Classes

DAB RAB LB GAB Test Error Test Error 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5

0 200 400 600 800 0 200 400 600 800

Number of Terms Number of Terms

Stumps - 3 Classes Stumps - 5 Classes Test Error Test Error 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.8

0 200 400 600 800 0 200 400 600 800

Number of Terms Number of Terms

Figure 3: Additive Decision Boundary. In al l panels except the the top right, the

solid curve (representing discrete AdaBoost) lies alone above the other three curves. 25

dicult as represented by increased error rates for all four metho ds, but

their relationship is roughly the same: the upp er (black) line represents DAB

and the other three nearly coincident lines are the other three metho ds. The

situation is somewhat di erent for larger numb er of classes (K  4). Figure

3 [lower-right] shows results for K = 5 which are typical for K  4: As

b efore, DAB incurs much higher error rates than all the others, and RAB

and GAB have nearly identical p erformance. However, the p erformance of

LB relative to RAB and GAB has changed. Up to ab out 40 iterations it

has the same error rate. From 40 to ab out 100 iterations LB's error rates

are somewhat higher than the other two. After 100 iterations the error

rate for LB continues to improve whereas that for RAB and GAB level o ,

decreasing much more slowly. By 800 iterations the error rate for LB is 0.19

whereas that for RAB and GAB is 0.32. Sp eculation as to the reason for

LB's p erformance gain in these situations is presented b elow.

In the ab ove examples a two-terminal-no de tree (stump) was used as the

base classi er. One might exp ect the use of larger trees would do b etter for

these rather complex problems. Figure 3 [top-right] shows results for the

two{class problem, here b o osting trees with eight terminal no des. These

results can b e compared to those for stumps in Fig. 3 [top-left]. Initially,

error rates for b o osting eight no de trees decrease much more rapidly than

for stumps, with each successive iteration, for all metho ds. However, the

error rates quickly level o and improvement is very slow after ab out 100

iterations. The overall p erformance of DAB is much improved with the

bigger trees, coming close to that of the other three metho ds. As b efore

RAB, GAB, and LB exhibit nearly identical p erformance. Note that at

each iteration the eight{no de tree mo del consists of four{times the numb er

of additive terms as do es the corresp onding stump mo del. This is why the

error rates decrease so much more rapidly in the early iterations. In terms

of mo del complexity (and training time), a 100 iteration mo del using eight-

terminal no de trees is equivalent to a 400 iteration stump mo del .

Comparing the top-two panels in Fig. 3 one sees that for RAB, GAB,

and LB the error rate using the bigger trees (.072) is in fact 33% higher than

that for stumps (.054) at 800 iterations, even though the former is four times

more complex. This seemingly mysterious b ehavior is easily understo o d by

examining the nature of the decision b oundary separating the classes. In

general, the decision b oundary b etween any two classes can b e de ned by

an equation of the form B (x) = c; using sig n[B (x) c] to discriminate

b etween the two classes results in the optimal Bayes rule. It is the goal

of any classi er to approximate B (x) (which is not necessarily unique) as

closely as p ossible under whatever constraints (concept bias) are asso ciated 26

with the classi er. As discussed ab ove, b o osting pro duces an additive mo del

whose comp onents (basis functions) are represented by the base classi er.

If a b oundary function B (x) for a particular problem happ ens to b e well

approximated by such an additive mo del then p erformance is likely to b e

go o d; if not, high error rates are a likely result.

With stumps as the base classi er each basis function involves only a

single (splitting) predictor variable

f (x) = c 1 : (40)

k k

(s (x t )>0)

k k

j (k )

Here j (k ) is the split variable, t the split p oint , and s = 1 is the

k k

direction (+ = right son, = left son) asso ciated with the k th term. The

b o osted mo del is a linear combination of such single variable functions. Let

X

g (x ) = f (x ): (41)

j j j

k

j (k )=j

That is, each g (x ) is the sum of all (stump) functions involving the same

j j

(j th) predictor, with g (x ) = 0 if none exist. Then the mo del pro duced by

j j

b o osting stumps is additive in the original features

p

X

F (x) = g (x ): (42)

j j

j =1

Examination of (38) and (39) reveals that an optimal decision b oundary

for the ab ove examples is also additive in the original features; f (x ) =

j j

2

+ const: Thus, in the context of decision trees, stumps are ideally matched x

j

to these problems; larger trees are not needed. However b o osting larger

trees need not b e counter pro ductive in this case if all of the splits in each

individual tree are made on the same predictor variable. This would also

pro duce an additive mo del in the original features (42). However, due to

the forward greedy stage-wise strategy used by b o osting, this is not likely to

happ en if the decision b oundary function involves more than one predictor;

each individual tree will try to do its b est to involve all of the imp ortant

predictors. Owing to the nature of decision trees, this will pro duce mo dels

with interaction e ects; most terms in the mo del will involve pro ducts in

more than one variable. Such non-additive mo dels are not as well suited

for approximating truly additive decision b oundaries such as (38) and (39).

This is re ected in increased error rate as observed in Fig. 3.

It should b e noted that if B (x) = c describ es a decision b oundary, any

monotone transformation of b oth sides describ es the same b oundary. This 27

fact makes b o osting stumps more general than one might at rst think. For

Q

p

example, supp ose B (x) = f (x ) with all quantities b eing strictly p osi-

j

j =1

tive. Such a b oundary is very complex involving interactions to the highest

P

p

~

order. However, B (x) = log B (x) = log f (x ) is an additive function

j j

j =1

~

(in the original variables) and B (x) = log c describ es the same b oundary.

Thus, b o osting stumps would b e optimal in spite of the apparent complexity

of this B (x). More generally if any monotone transformation of the deci-

sion b oundary de nition renders it approximately additive, b o osting stumps

should b e comp etitive. Also note that the \naive" Bayes pro cedure, which is

often highly comp etitive, pro duces decision b oundary estimates that, after

logarithmic transformation, are additive in the original predictors. Bo osting

decision stumps is considerably more exible than most implementations of

\naive" Bayes.

The ab ove discussion suggests that if the decision b oundary separating

pairs of classes were inherently non-additive in the original predictor vari-

ables, then b o osting stumps would b e less advantageous than using larger

trees. A tree with m terminal no des can pro duce basis functions with a

maximum interaction order of min(m 1; p) where p is the numb er of pre-

dictor features. These higher order basis functions provide the p ossibility

to more accurately estimate those B (x) with high order interactions. The

purp ose of the next example is to verify this intuition. There are two classes

5000

drawn for a ten- (K = 2) and 5000 training observations with the fx g

i

1

dimensional normal distribution as in the previous examples. Class lab els

were randomly assigned to each observation with log-o dds

!

 

6 6

X X

Pr[y = 1 j x]

l

1 + (1) x : = 10 x log

j

l

Pr[y = 1 j x]

j =1

l =1

Approximately equal numb ers of observations are assigned to each of the

two classes, and the Bayes error rate is 0.046. The decision b oundary for

this problem is a complicated function of the rst six predictor variables

involving all of them in second order interactions of equal strength. As in

the ab ove examples, test sets of 10000 observations was used to estimate

error rates for each training set, and nal estimates were averages over ten

replications.

Figure 4 [top-left] shows test-error rate as a function of iteration numb er

for each of the four b o osting metho ds using stumps. As in the previous

examples, RAB and GAB track each other very closely. DAB b egins very

slowly, b eing dominated by all of the others until around 180 iterations,

where it passes b elow RAB and GAB. LB mostly dominates, having the 28 Non-Additive Decision Boundary

Stumps - 2 Classes Four Node Trees - 2 Classes Test Error Test Error DAB RAB LB GAB 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5

0 200 400 600 800 0 200 400 600 800

Number of Terms Number of Terms

Eight Node Trees - 2 Classes Test Error 0.0 0.1 0.2 0.3 0.4 0.5

0 200 400 600 800

Number of Terms

Figure 4: Interactive Decision Boundary 29

lowest error rate until ab out 650 iterations. At that p oint DAB catches up

and by 800 iterations it may have a very slight edge. However, none of these

b o osting metho ds p erform well with stumps on this problem, the b est error

rate b eing 0:35.

Figure 4 [top-right] shows the corresp onding plot when four terminal

no de trees are b o osted. Here there is a dramatic improvement with all

of the four metho ds. For the rst time there is some small di erentiation

b etween RAB and GAB. At nearly all iterations the p erformance ranking

is LB b est, followed by GAB, RAB, and DAB in order. At 800 iterations

LB achieves an error rate of 0:134. Figure 4 [lower-left] shows results when

eight terminal no de trees are b o osted. Here, error rates are generally further

reduced with LB improving the least (0:130), but still dominating. The

p erformance ranking among the other three metho ds changes with increasing

iterations; DAB overtakes RAB at around 150 iterations and GAB at ab out

230 b ecoming fairly close to LB by 800 iterations with an error rate of 0:138.

Although limited in scop e, these simulation studies suggest several trends.

They explain why b o osting stumps can sometimes b e sup erior to using larger

trees, and suggest situations where this is likely to b e the case; that is when

decision b oundaries B (x) can b e closely approximated by functions that are

additive in the original predictor features. When higher order interactions

are required stumps exhibit p o or p erformance. These examples illustrate

the close similarity b etween RAB and GAB. In all cases the di erence in

p erformance b etween DAB and the others decreases when larger trees and

more iterations are used, sometimes overtaking the others. More generally,

relative p erformance of these four metho ds dep ends on the problem at hand

in terms of the nature of the decision b oundaries, the complexity of the base

classi er, and the numb er of b o osting iterations.

The sup erior p erformance of LB in Fig. 3 [lower-right] app ears to b e

a consequence of the multi{class logistic mo del (Algorithm 6). All of the

other metho ds use the asymmetric AdaBo ost.MH strategy (Algorithm 5)

of building separate two{class mo dels for each individual class against the

p o oled complement classes. Even if the decision b oundaries separating all

class pairs are relatively simple, p o oling classes can pro duce complex deci-

sion b oundaries that are dicult to approximate (Friedman 1996). By con-

sidering all of the classes simultaneously, the symmetric multi{class mo del

is b etter able to take advantage of simple pairwise b oundaries when they

exist (Hastie & Tibshirani 1998). As noted ab ove, the pairwise b oundaries

induced by (38) and (39) are simple when viewed in the context of additive

mo deling, whereas the p o oled b oundaries are more complex; they cannot b e

well approximated by functions that are additive in the original predictor 30

variables.

The decision b oundaries asso ciated with these examples were delib er-

ately chosen to b e geometrically complex in an attempt to illicit p erformance

di erences among the metho ds b eing tested. Such complicated b oundaries

are not likely to often o ccur in practice. Many practical problems involve

comparatively simple b oundaries (Holte 1993); in such cases p erformance

di erences will still b e situation dep endent, but corresp ondingly less pro-

nounced.

6 Some exp eriments with data

In this section we show the results of running the four tting metho ds:

LogitBo ost, Discrete AdaBo ost, Real AdaBo ost, and Gentle AdaBo ost on

a collection of datasets from the UC-Irvine machine learning archive, plus

a p opular simulated dataset. The base learner is a tree in each case, with

either 2 terminal no des (\stumps") or 8 terminal no des. For comparison,

a single CART decision tree was also t, with the tree size determined by

5-fold cross-validation.

The datasets are summarized in Table 1. The test error rates are shown

in Table 2 for the smaller datasets, and in Table 3 for the larger ones. The

vowel, sonar, satimage and letter datasets come with a pre-sp eci ed test

set. The waveform data is simulated, as describ ed in (Breiman et al. 1984).

For the others, 5-fold cross-validation was used to estimate the test error.

It is dicult to discern trends on the small data sets (Table 2) b ecause

all but quite large observed di erences in p erformance could b e attributed

to sampling uctuations. On the vowel, breast cancer, ionosphere,

sonar, and waveform data, purely additive (two-terminal no de tree) mo dels

seem to p erform comparably to the larger (eight-no de) trees. The glass

data seems to b ene t a little from larger trees. There is no clear di erenti-

ation in p erformance among the b o osting metho ds.

On the larger data sets (Table 3) clearer trends are discernible. For the

satimage data the eight-no de tree mo dels are only slightly, but signi cantly,

more accurate than the purely additive mo dels. For the letter data there

is no contest. Bo osting stumps is clearly inadequate. There is no clear

di erentiation among the b o osting metho ds for eight-no de trees. For the

stumps, LogitBo ost, Real AdaBo ost, and Gentle AdaBo ost have comparable

p erformance, distinctly sup erior to Discrete Adab o ost. This is consistent

with the results of the simulation study (Section 5).

Except p erhaps for Discrete AdaBo ost, the real data examples fail to 31

Table 1: Datasets used in the experiments

Data set # Train # Test # Inputs # Classes

vowel 528 462 10 11

breast cancer 699 5-fold CV 9 2

ionosphere 351 5-fold CV 34 2

glass 214 5-fold CV 10 7

sonar 210 5-fold CV 60 2

waveform 300 5000 21 3

satimage 4435 2000 36 6

letter 16000 4000 16 26

demonstrate p erformance di erences b etween the various b o osting metho ds.

This is in contrast to the simulated data sets of Section 5. There LogitBo ost

generally dominated, although often by a small margin. The inability of

the real data examples to discriminate may re ect statistical diculties in

estimating subtle di erences with small samples. Alternatively, it may b e

that the their underlying decision b oundaries are all relatively simple (Holte

1993) so that all reasonable metho ds exhibit similar p erformance.

7 Additive Logistic Trees

In most applications of b o osting the base classi er is considered to b e a prim-

itive, rep eatedly called by the b o osting pro cedure as iterations pro ceed. The

op erations p erformed by the base classi er are the same as they would b e in

any other context given the same data and weights. The fact that the nal

mo del is going to b e a linear combination of a large numb er of such classi ers

is not taken into account. In particular, when using decision trees, the same

tree growing and pruning algorithms are generally employed. Sometimes al-

terations are made (such as no pruning) for programming convenience and

sp eed.

When b o osting is viewed in the light of additive mo deling, however, this

greedy approach can b e seen to b e far from optimal in many situations. As

discussed in Section 5 the goal of the nal classi er is to pro duce an accurate

approximation to the decision b oundary function B (x). In the context of

b o osting, this goal applies to the nal additive mo del, not to the individual

terms (base classi ers) at the time they were constructed. For example, it

was seen in Section 5 that if B (x) was close to b eing additive in the original 32

Table 2: Test error rates on smal l real examples

Metho d 2 Terminal No des 8 Terminal No des

50 100 200 Iterations 50 100 200

Vowel CART error= .642

LogitBo ost .532 .524 .511 .517 .517 .517

.496 .496 .496 Real AdaBo ost .565 .561 .548

.515 .496 .496 Gentle AdaBo ost .556 .571 .584

.511 .500 .500 Discrete AdaBo ost .563 .535 .563

Breast CART error= .045

LogitBo ost .028 .031 .029 .034 .038 .038

.032 .034 .034 Real AdaBo ost .038 .038 .040

.032 .031 .031 Gentle AdaBo ost .037 .037 .041

.032 .035 .037 Discrete AdaBo ost .042 .040 .040

Ion CART error= .076

.068 .063 .063 LogitBo ost .074 .077 .071

Real AdaBo ost .068 .066 .068 .054 .054 .054

.066 .063 .063 Gentle AdaBo ost .085 .074 .077

.068 .063 .063 Discrete AdaBo ost .088 .080 .080

Glass CART error= .400

.243 .238 .238 LogitBo ost .266 .257 .266

Real AdaBo ost .276 .247 .257 .234 .234 .234

.219 .233 .238 Gentle AdaBo ost .276 .261 .252

.238 .234 .243 Discrete AdaBo ost .285 .285 .271

Sonar CART error= .596

.163 .154 .154 LogitBo ost .231 .231 .202

GA .154 .163 .202 .173 .173 .173

.154 .154 .154 SA .183 .183 .173

.163 .144 .144 Discrete AdaBo ost .154 .144 .183

Waveform CART error= .364

LogitBo ost .196 .195 .206 .192 .191 .191

.185 .182 .182 Real AdaBo ost .193 .197 .195

.185 .185 .186 Gentle AdaBo ost .190 .188 .193

.186 .183 .183 Discrete AdaBo ost .188 .185 .191 33

Table 3: Test error rates on larger data examples.

Metho d Terminal Iterations Fraction

No des 20 50 100 200

Satimage CART error = .148

LogitBo ost 2 .140 .120 .112 .102

Real AdaBo ost 2 .148 .126 .117 .119

Gentle AdaBo ost 2 .148 .129 .119 .119

Discrete AdaBo ost 2 .174 .156 .140 .128

LogitBo ost 8 .096 .095 .092 .088

Real AdaBo ost 8 .105 .102 .092 .091

Gentle AdaBo ost 8 .106 .103 .095 .089

Discrete AdaBo ost 8 .122 .107 .100 .099

Letter CART error = .124

LogitBo ost 2 .250 .182 .159 .145 .06

Real AdaBo ost 2 .244 .181 .160 .150 .12

Gentle AdaBo ost 2 .246 .187 .157 .145 .14

Discrete AdaBo ost 2 .310 .226 .196 .185 .18

LogitBo ost 8 .075 .047 .036 .033 .03

Real AdaBo ost 8 .068 .041 .033 .032 .03

Gentle AdaBo ost 8 .068 .040 .030 .028 .03

Discrete AdaBo ost 8 .080 .045 .035 .029 .03 34

predictive features, then b o osting stumps was optimal since it pro duced an

approximation with the same structure. Building larger trees increased the

error rate of the nal mo del b ecause the resulting approximation involved

high order interactions among the features. The larger trees optimized error

rates of the individual base classi ers, given the weights at that step, and

even pro duced lower unweighted error rates in the early stages. But, after

a sucient numb er of b o osts, the stump based mo del achieved sup erior

p erformance.

More generally, one can consider an expansion of the of the decision

b oundary function in a functional ANOVA decomp osition (Friedman 1991)

X X X

B (x) = f (x ; x ; x ) + ::: (43) f (x ; x ) + f (x ) +

j j j j

j k l k l j k k

j

j;k ;l j;k

The rst sum represents the closest function to B (x) that is additive

in the original features, the rst two represent the closest approximation

involving at most two{feature interactions, the rst three represent three{

feature interactions, and so on. If B (x) can b e accurately approximated

by such an expansion, truncated at low interaction order, then allowing the

base classi er to pro duce higher order interactions can reduce the accuracy

of the nal b o osted mo del. In the context of decision trees, higher order

interactions are pro duced by deep er trees.

In situations where the true underlying decision b oundary function ad-

mits a low order ANOVA decomp osition, one can take advantage of this

structure to improve accuracy by restricting the depth of the base deci-

sion trees to b e not much larger than the actual interaction order of B (x).

Since this is not likely to b e known in advance for any particular problem,

this maximum depth b ecomes a \meta-parameter" of the pro cedure to b e

estimated by some mo del selection technique, such as cross-validation.

One can restrict the depth of an induced decision tree by using its stan-

dard pruning pro cedure, starting from the largest p ossible tree, but requiring

it to delete enough splits to achieve the desired maximum depth. This can

b e computationally wasteful when this depth is small. The time required to

build the tree is prop ortional to the depth of the largest p ossible tree b e-

fore pruning. Therefore, dramatic computational savings can b e achieved by

simply stopping the growing pro cess at the maximum depth, or alternatively

at a maximum numb er of terminal no des. The standard heuristic arguments

in favor of growing large trees and then pruning do not apply in the context

of b o osting. Shortcomings in any individual tree can b e comp ensated by

trees grown later in the b o osting sequence. 35

If a truncation strategy based on numb er of terminal no des is to b e

employed, it is necessary to de ne an order in which splitting takes place.

We adopt a \b est{ rst" strategy. An optimal split is computed for each

currently terminal no de. The no de whose split would achieve the greatest

reduction in the tree building criterion is then actually split. This increases

the numb er of terminal no des by one. This continues until a maximum num-

b er M of terminal notes is induced. Standard computational tricks can b e

employed so that inducing trees in this order requires no more computation

than other orderings commonly used in decision tree induction.

The truncation limit M is applied to all trees in the b o osting sequence.

It is thus a meta{parameter of the entire b o osting pro cedure. An optimal

value can b e estimated through standard mo del selection techniques such as

minimizing cross-validated error rate of the nal b o osted mo del. We refer

Coordinate Functions for Additive Logistic Trees

f(x1) f(x2) f(x3) f(x4) f(x5)

f(x6) f(x7) f(x8) f(x9) f(x10)

Figure 5: Coordinate functions for the additive logistic tree obtained by boosting

with stumps, for the two-class nested sphere example from Section 5.

to this combination of truncated b est{ rst trees, with b o osting, as \addi-

tive logistic trees" (ALT). This is the pro cedure used in all of the simulated

and real examples. One can compare results on the latter (Tables 2 and 3)

to corresp onding results rep orted by Dietterich (1998, Table 1) on common

data sets. Error rates achieved by ALT with very small truncation values

are seen to compare quite favorably with other committee approaches us-

ing much larger trees at each b o osting step. Even when error rates are the

same, the computational savings asso ciated with ALT can b e quite imp or-

tant in data mining contexts where large data sets cause computation time

to b ecome an issue.

Another advantage of low order approximations is mo del visualization.

In particular, for mo dels additive in the original features (42) , the contri- 36

bution of each feature x can b e viewed as a graph of g (x ) (41) plotted

j j j

against x . Figure 5 shows such plots for the ten features of the two{class

j

nested spheres example of Fig. 3. The functions are shown for the rst class

concentrated near the origin; the corresp onding functions for the other class

are the negatives of these functions.

The plots in Fig. 5 clearly show that the contribution to the log{o dds

of each individual feature, for given values of the other features, is approxi-

mately quadratic, except p erhaps for the extreme values where there is little

data. They indicate that observations with feature values closer to the ori-

gin have increased o dds of b eing from the rst class. Their sum is clearly

indicative of the additive quadratic nature of the decision b oundary (38)

and (39).

When there are more than two classes plots similar to Fig. 5 can b e

made for each class, and analogously interpreted. Higher order interactions

mo dels are more dicult to visualize. If there are at most two{feature inter-

actions, the two{variable contributions can b e visualized using contour or

p ersp ective mesh plots. Beyond two-feature interactions, visualization tech-

niques are even less e ective. Even when non-interaction (stump) mo dels

do not achieve the highest accuracy, they can b e very useful as descriptive

statistics owing to the interpretability of the resulting mo del.

8 Weight trimming

In this section we prop ose a simple idea and show that it can dramatically re-

duce computation for b o osted mo dels without sacri cing accuracy. Despite

its apparent simplicity this approach do es not app ear to b e in common use.

At each b o osting iteration there is a distribution of weights over the train-

ing sample. As iterations pro ceed this distribution tends to b ecome highly

skewed towards smaller weight values. A larger fraction of the training sam-

ple b ecomes correctly classi ed with increasing con dence, thereby receiving

smaller weights. Observations with very low relative weight have little im-

pact on training of the base classi er; only those that carry the dominant

prop ortion of the weight mass are in uential. The fraction of such high

weight observations can b ecome very small in later iterations. This suggests

that at any iteration one can simply delete from the training sample the

large fraction of observations with very low weight without having much

e ect on the resulting induced classi er. However, computation is reduced

since it tends to b e prop ortional to the size of the training sample, regardless

of weights. 37

At each b o osting iteration, training observations i whose weight w is

i

less than a threshold w < t( ) are not used to train the classi er. We

i

take the value of t( ) to b e the th quantile of the weight distribution

over the training data at the corresp onding iteration. That is, only those

observations that carry the fraction 1 of the total weight mass are used for

training. Typically 2 [0:01; 0:1] so that the data used for training carries

from 90 to 99 p ercent of the total weight mass. Note that the weights for

al l training observations are recomputed at each iteration. Observations

deleted at a particular iteration may therefore re-enter at later iterations if

their weights subsequently increase relative to other observations. Test Error 0.1 0.2 0.3 Percent Training Data Used 0 20 40 60 80 100

0 100 200 300 400 0 100 200 300 400

Number of Terms Number of Terms

Figure 6: The left panel shows the test error for the letter recognition problem

as a function of iteration number. The black solid curve uses al l the training

data, the blue dashed curve uses a subset based on weight thresholding. The

right panel shows the percent of training data used for both approaches.

Figure 6 [left panel] shows test-error rate as a function of iteration num-

b er for the letter recognition problem describ ed in Section 6, here using

Gentle AdaBo ost and eight no de trees as the base classi er. Two error rate

curves are shown. The black solid one represents using the full training

sample at each iteration ( = 0), whereas the blue dashed curve represents

the corresp onding error rate for = 0:1: The two curves track each other 38

very closely esp ecially at the later iterations. Figure 6 [right panel] shows

the corresp onding fraction of observations used to train the base classi er as

a function of iteration numb er. Here the two curves are not similar. With

= 0:1 the numb er of observations used for training drops very rapidly

reaching roughly 5% of the total at 20 iterations. By 50 iterations it is down

to ab out 3% where it stays throughout the rest of the b o osting pro cedure.

Thus, computation is reduced by over a factor of 30 with no apparent loss

in classi cation accuracy. The reason why sample size in this case decreases

for = 0 after 150 iterations, is that if all of the observations in a particular

class are classi ed correctly with very high con dence (F > 15 + log (N ))

k

training for that class stops, and continues only for the remaining classes.

At 200 iterations, 10 classes remained of the original 26 classes.

The last column lab eled fraction in Table 3 shows the average fraction

of observations used in training the base classi ers over the 200 iterations,

for all b o osting metho ds and tree sizes. For eight-no de trees, all metho ds

b ehave as shown in Fig. 6. With stumps, LogitBo ost uses considerably less

data than the others and is thereby corresp ondingly faster.

This is a genuine prop erty of LogitBo ost that sometimes gives it an ad-

vantage with weight trimming. Unlike the other metho ds, the LogitBo ost

weights w = p (1 p ) do not in any way involve the class outputs y ;

i i i i

they simply measure nearness to the currently estimated decision b oundary

F (x) = 0. Discarding small weights thus retains only those training obser-

M

vations that are estimated to b e close to the b oundary. For the other three

pro cedures the weight is monotone in y F (x ). This gives highest weight

i M i

to currently misclassi ed training observations, esp ecially those far from the

b oundary. If after trimming the fraction of observations remaining is less

than the error rate, the subsample passed to the base learner will b e highly

unbalanced containing very few correctly classi ed observations. This imbal-

ance seems to inhibit learning. No such imbalance o ccurs with LogitBo ost

since near the decision b oundary, correctly and misclassi ed observations

app ear in roughly equal numb ers.

As this example illustrates, very large reductions in computation for

b o osting can b e achieved by this simple trick. A variety of other examples

(not shown) exhibit similar b ehavior with all b o osting metho ds. Note that

other committee approaches to classi cation such as bagging (Breiman 1996)

and randomized trees (Dietterich 1998), while admitting parallel implemen-

tations, cannot take advantage of this approach to reduce computation. 39

9 Concluding remarks

In order to understand a learning pro cedure statistically it is necessary to

identify two imp ortant asp ects: its structural mo del and its error mo del.

The former is most imp ortant since it determines the function (concept)

space of the approximator, thereby characterizing the class of concepts that

can b e accurately approximated (learned) with it. The error mo del sp eci es

the distribution of random departures of sampled data from the structural

mo del. It thereby de nes the criterion to b e optimized in the estimation of

the structural mo del.

We have shown that the structural mo del for b o osting is additive on the

logistic scale with the base (weak) learner as basis elements. This under-

standing alone explains many of the prop erties of b o osting. It is no surprise

that a large numb er of such (jointly optimized) basis elements de nes a

much richer class of learners than one of them alone. It reveals that in the

context of b o osting all base (weak) learners are not equivalent, and there

is no universally b est choice over all situations. As illustrated in Section 5

the base learners need to b e chosen so that the resulting additive expansion

matches the particular decision b oundary encountered. Even in the limited

context of b o osting decision trees the interaction order, as characterized by

the numb er of terminal no des, needs to b e chosen with care. Purely ad-

ditive mo dels induced by decision stumps are sometimes, but not always,

the b est. However, we conjecture that b oundaries involving very high order

interactions are likely to b e encountered rarely in practice. This motivates

our additive logistic trees (ALT) pro cedure describ ed in Section 7.

The error mo del for two-class b o osting is the obvious one for binary

variables, namely the Bernoulli distribution. We show that the AdaBo ost

pro cedures maximize a criterion that is closely related to exp ected log{

Bernoulli likeliho o d, having the identical solution in the distributional (L )

2

limit of in nite data. We derived a more direct pro cedure for maximizing

this log-likeliho o d (LogitBo ost) and show that it exhibits prop erties nearly

identical to those of Real AdaBo ost.

In the multi-class case, the AdaBo ost pro cedures maximize a separate

Bernoulli likeliho o d for each class versus the others. This is a natural choice

and is esp ecially appropriate when observations can b elong to more than

one class (Schapire & Singer 1998). In the more usual setting of a unique

class lab el for each observation, the symmetric multinomial distribution is

a more appropriate error mo del. We develop a multi-class LogitBo ost pro-

cedure that maximizes the corresp onding log-likeliho o d by quasi-Newton

stepping. We show through simulated examples that there exist settings 40

where this approach leads to sup erior p erformance, although none of these

situations seems to have b een encountered in the set of real data examples

used for illustration; the p erformance of b oth approaches had quite similar

p erformance over these examples.

The concepts develop ed in this pap er suggest that there is very little, if

any, connection b etween (deterministic) weighted b o osting and other (ran-

domized) ensemble metho ds such as bagging (Breiman 1996) and random-

ized trees (Dietterich 1998). In the language least squares regression, the

latter are purely \variance" reducing pro cedures intended to mitigate insta-

bility, esp ecially that asso ciated with (large) decision trees. Bo osting on the

other hand seems fundamentally di erent. It app ears to b e a purely \bi-

as" reducing pro cedure, intended to increase the exibility of stable (highly

biased) weak learners by incorp orating them in a jointly tted additive ex-

pansion.

The distinction b ecomes less clear when b o osting is implemented by nite

random imp ortance sampling instead of weights. The advantages/disadvantages

of intro ducing randomization into b o osting by drawing nite samples is not

clear. If there turns out to b e an advantage with randomization in some sit-

uations, then the degree of randomization, as re ected by the sample size,

is an op en question. It is not obvious that the common choice of using the

size of the original training sample is optimal in all (or any) situations.

One fascinating issue not covered in this pap er is the fact that b o osting,

whatever avor, seldom seems to over t, no matter how many terms are

included in the additive expansion. Some p ossible explanations are:

 As the LogitBo ost iterations pro ceed, the overall impact of changes in-

tro duced by f (x) reduces. Only observations with appreciable weight

m

determine the new functions | those near the decision b oundary. By

de nition these observations have F (x) near zero and can b e a ected

by changes, while those in pure regions have large values of jF (x)j are

are less likely to b e mo di ed.

 The stage-wise nature of the b o osting algorithms do not allow the

full collection of parameters to b e jointly t, and thus have far lower

variance than the full parameterization might suggest. In the Machine-

Learning literature this is explained in terms of VC dimension of the

ensemble compared to that of each weak learner.

 Classi ers are hurt less by over tting than other function estimators

(e.g. the famous risk b ound of the 1-nearest-neighb or classi er (Cover

& Hart 1967)). 41

Whatever the explanation, the empirical evidence is strong; the intro duc-

tion of b o osting by Schapire, Freund and colleagues has brought an exciting

and imp ortant set of new ideas to the table.

Acknowledgements

We thank Andreas Buja for alerting us to the recent work on text classi ca-

tion at AT&T lab oratories, and Bogdan Pop escu for illuminating discussions

on PAC learning theory. Jerome Friedman was partially supp orted by the

Department of Energy under contract numb er DE-AC03-76SF00515 and by

grant DMS-9764431 of the National Science Foundation. was

partially supp orted by grants DMS-9504495 and DMS-9803645 from the Na-

tional Science Foundation, and grant ROI-CA-72028-01 from the National

Institutes of Health. Rob ert Tibshirani was supp orted by the Natural Sci-

ences and Engineering Research Council of Canada.

References

Breiman, L. (1996), `Bagging predictors', Machine Learning 26 .

Breiman, L. (1997), Prediction games and arcing algorithms, Technical Re-

p ort Technical Rep ort 504, Statistics Department, University of Cali-

fornia, Berkeley. Submitted to Neural Computing.

Breiman, L., Friedman, J., Olshen, R. & Stone, C. (1984), Classi cation

and Regression Trees, Wadsworth, Belmont, California.

Buja, A., Hastie, T. & Tibshirani, R. (1989), `Linear smo others and additive

mo dels (with discussion)', Annals of Statistics 17, 453{555.

Cover, T. & Hart, P. (1967), `Nearest neighb or pattern classi cation', Proc.

IEEE Trans. Inform. Theory pp. 21{27.

Dietterich, T. (1998), `An exp erimental comparison of three metho ds for

constructing ensembles of decision trees: bagging, b o osting, and ran-

domization', Machine Learning ?, 1{22.

Freund, Y. (1995), `Bo osting a weak learning algorithm by ma jority', Infor-

mation and Computation 121(2), 256{285.

Freund, Y. & Schapire, R. (1996), Exp eriments with a new b o osting algo-

rithm, in `Machine Learning: Pro ceedings of the Thirteenth Interna-

tional Conference', pp. 148{156. 42

Friedman, J. (1991), `Multivariate adaptive regression splines (with discus-

sion)', Annals of Statistics 19(1), 1{141.

Friedman, J. (1996), Another approach to p olychotomous classi cation,

Technical rep ort, Stanford University.

Friedman, J. & Stuetzle, W. (1981), `Pro jection pursuit regression', Journal

of the American Statistical Association 76, 817{823.

Hastie, T. & Tibshirani, R. (1990), Generalized Additive Models, Chapman

and Hall.

Hastie, T. & Tibshirani, R. (1998), `Classi cation by pairwise coupling',

Annals of Statistics . (to app ear).

Hastie, T., Tibshirani, R. & Buja, A. (1994), `Flexible discriminant analysis

by optimal scoring', Journal of the American Statistical Association

89, 1255{1270.

Holte, R. (1993), `Very simple classi cation rules p erform well on most com-

monly used datasets', Machine Learning 11, 63{90.

Kearns, M. & Vazirani, U. (1994), An Introduction to Computational Learn-

ing Theory, MIT Press.

Mallat, S. & Zhang, Z. (1993), `Matching pursuits with time-frequency dic-

tionaries', IEEE Transactions on Signal Processing 41, 3397{3415.

McCullagh, P. & Nelder, J. (1989), Generalized Linear Models, Chapman

and Hall.

Schapire, R. (1990), `The strength of weak learnability', Machine Learning

5(2), 197{227.

Schapire, R. & Singer, Y. (1998), Improved b o osting algorithms using

con dence-rated predictions, in `Pro ceedings of the Eleventh Annual

Conference on Computational Learning Theory'. 43