Additive Logistic Regression a Statistical View of Bo osting

y

Jerome Friedman

z

Trevor Hastie

z

Robert Tibshirani

August

Revision April

Second Revision November

Abstract

Bo osting Schapire Freund Schapire is one of the most imp ortant recent devel

opments in classication metho dology Bo osting works by sequentially applying a classication

algorithm to reweighted versions of the training data and then taking a weighted ma jorityvote

of the sequence of classiers thus pro duced For many classication algorithms this simple strat

egy results in dramatic improvements in p erformance Weshow that this seemingly mysterious

phenomenon can b e understo o d in terms of well known statistical principles namely additive

mo deling and maximum likeliho o d For the twoclass problem b o osting can b e viewed as an

approximation to additive mo deling on the logistic scale using maximum Bernoulli likeliho o d

as a criterion Wedevelop more direct approximations and show that they exhibit nearly iden

tical results to b o osting Direct multiclass generalizations based on multinomial likeliho o d are

derived that exhibit p erformance comparable to other recently prop osed multiclass generaliza

tions of b o osting in most situations and far sup erior in some We suggest a minor mo dication

to b o osting that can reduce computation often by factors of to Finallywe apply these

insights to pro duce an alternativeformulation of b o osting decision trees This approach based

on b estrst truncated tree induction often leads to b etter p erformance and can provide in

terpretable descriptions of the aggregate decision rule It is also much faster computationally

making it more suitable to large scale applications

Intro duction

The starting point for this pap er is an interesting pro cedure called b o osting whichis a way of

combining the p erformance of manyweak classiers to pro duce a p owerful committee Bo osting

was prop osed in the Computational Learning Theory literature Schapire Freund Freund

Schapire and has since received much attention

While b o osting has evolved somewhat over the years we describ e the most commonly used

version of the AdaBoost pro cedure Freund Schapire b whichwe call Discrete AdaBoost

Here is a concise description of AdaBo ost in the twoclass classication setting We have train

ing data x y x y with x a vector valued feature and y or We dene

N N i i

Department of Statistics Sequoia Hall California fjhfhastietibsgstatstanfordedu

y

Stanford Linear Accelerator Center Stanford CA

z

Division of Biostatistics Department of Health Research and Policy Stanford University Stanford CA

Essentially the same as AdaBo ostM for binary data Freund Schapire b

P

M

c f x where each f x is a classier pro ducing values and c are constants F x

m m m m

the corresp onding prediction is signF x The AdaBo ost pro cedure trains the classiers f xon

m

weighted versions of the training sample giving higher weight to cases that are currently misclassi

ed This is done for a sequence of weighted samples and then the nal classier is dened to b e a

linear combination of the classiers from each stage A detailed description of Discrete AdaBo ost

is given in the b oxed display titled Algorithm

Discrete AdaBo ostFreund Schapire b

Start with weights w N i N

i

Rep eat for m M

a Fit the classier f x f g using weights w on the training data

m i

b Compute err E c log err err

m w m m m

y f x

m

P

c Set w w expc i N and renormalize so that w

i i m i

y f x

i

m

i i

P

M

Output the classier sign c f x

m m

m

Algorithm E represents expectation over the training data with weights w w w w and

w   n

is the indicator of the set S At each iteration AdaBoost increases the weights of the observations

S 

misclassiedbyf x by a factor that depends on the weightedtraining error

m

Much has b een written ab out the success of AdaBo ost in pro ducing accurate classiers Many

authors have explored the use of a treebased classier for f x and have demonstrated that it

m

consistently pro duces signicantly lower error rates than a single decision tree In fact Breiman

NIPS workshop called AdaBo ost with trees the b est otheshelf classier in the world

see also Breiman b Interestingly in many examples the test error seems to consistently

decrease and then level o as more classiers are added rather than ultimately increase For some

reason it seems that AdaBo ost is resistanttoovertting

Figure shows the p erformance of Discrete AdaBo ost on a synthetic classication task using

TM

an adaptation of CART Breiman Friedman Olshen Stone as the base classier This

adaptation grows xedsize trees in a b estrst manner see Section page Included in

the gure is the bagged tree Breiman which averages trees grown on b o otstrap resampled

versions of the training data Bagging is purely a variancereduction technique and since trees

tend to have high variance bagging often pro duces go o d results

Early versions of AdaBo ost used a resampling scheme to implement step of Algorithm by

weighted sampling from the training data This suggested a connection with bagging and that a

ma jor comp onent of the success of b o osting has to do with variance reduction

However b o osting p erforms comparably well when

a weighted treegrowing algorithm is used in step rather than weighted resampling where

each training observation is assigned its weight w This removes the randomization comp o

i

nent essential in bagging

stumps are used for the weak learners Stumps are singlesplit trees with only two terminal

no des These typically have lowvariance but high bias Bagging p erforms very p o orly with

stumps Fig topright panel

These observations suggest that b o osting is capable of b oth bias and variance reduction and thus

diers fundamentally from bagging

Figure Test error for Bagging Discrete AdaBoost and Real AdaBoost on a simulated twoclass nested

spheres problem seeSection on page Therearetraining data points in dimensions and the

Bayes error rate is zero All trees aregrown bestrst without pruning The leftmost iteration corresponds

to a single tree

The base classier in Discrete AdaBo ost pro duces a classication rule f x X f g

m

where X is the domain of the predictive features x Freund Schapire b Breiman aand

Schapire Singer have suggested various mo dications to improve the b o osting algorithms

A generalization of Discrete AdaBo ost app eared in Freund Schapire b and was develop ed

further in Schapire Singer that uses realvalued condencerated predictions rather

than the f g of Discrete AdaBo ost The weak learner for this generalized b o osting pro duces

a mapping f x X R the sign of f x gives the classication and jf xj a measure of

m m m

the condence in the prediction This real valued contribution is combined with the previous

contributions with a multiplier c as b efore and a slightly dierent recip e for c is provided

m m

We present a generalized version of AdaBo ost which we call Real AdaBoost in Algorithm

in which the weak learner returns a class probability estimate p xP y jx The

m w

contribution to the nal classier is half the logittransform of this probability estimate One form

of Schapire and Singers generalized AdaBo ost coincides with Real AdaBo ost in the sp ecial case

where the weak learner is a decision tree Real AdaBo ost tends to p erform the b est in our simulated

examples in Fig esp ecially with stumps although we see with no de trees Discrete AdaBo ost

overtakes Real AdaBo ost after iterations

Real AdaBo ost

Start with weights w N i N

i

Rep eat for m M

a Fit the classier to obtain a class probability estimate p x P y jx

m w

using weights w on the training data

i

p x

m

log b Set f x R

m

p x

m

P

w c Set w w expy f x i N and renormalize so that

i i i i m i

i

P

M

f x Output the classier sign

m

m

Algorithm The Real AdaBoost algorithm uses class probability estimates p x to construct realvalued

m

contributions f x

m

In this pap er we analyze the AdaBo ost pro cedures from a statistical perspective The main

P

result of our pap er rederives AdaBo ost as a metho d for tting an additivemodel f x in a

m

m

forward stagewise manner This simple fact largely explains why it tends to outp erform a single

base learner By tting an additive mo del of dierent and p otentially simple functions it expands

the class of functions that can b e approximated

Given this fact Discrete and Real AdaBo ost app ear unnecessarily complicated Amuch simpler

P

way totanadditivemodelwould b e to minimize squarederror loss E y f x in a forward

m

stagewise manner Atthemth stage wexf x f x and minimize squared error to obtain

m

P

m

f xjx This is just tting of residuals and is commonly used in linear f x E y

j m

regression and additive mo deling Hastie Tibshirani

However squared error loss is not a go o d choice for classication see Figure and hence

tting of residuals do esnt work very well in that case We show that AdaBo ost ts an additive

mo del using a b etter loss function for classication Sp ecically we show that AdaBo ost ts an

additive logistic regression mo del using a criterion similar to but not the same as the binomial

loglikeliho o d If p x are the class probabilities an additive logistic regression approximates

m

P

log p x p x by an additive function f x We then go on to derive a new b o osting

m m m

m

pro cedure LogitBo ost that directly optimizes the binomial loglikeliho o d

The original b o osting techniques Schapire Freund provably improved or b o osted

the p erformance of a single classier by pro ducing a ma jority vote of similar classiers These

algorithms then evolved into more adaptive and practical versions such as AdaBo ost whose success

was still explained in terms of b o osting individual classiers by a weighted ma jority vote or

weighted committee We b elieve that this view along with the app ealing name b o osting

inherited by AdaBo ost may have led to some of the mystery ab out how and why the metho d

works As mentioned ab ove we instead view b o osting as a technique for tting an additive mo del

Section gives a short history of the b o osting idea In Section we briey review additive

mo deling Section shows how b o osting can be viewed as an additive mo del estimator and

prop oses some new b o osting metho ds for the two class case The multiclass problem is studied in

Section Simulated and real data exp eriments are discussed in Sections and Our treegrowing

implementation using truncated b estrst trees is describ ed in Section Weight trimming to

sp eed up computation is discussed in Section and we briey describ e generalizations of b o osting

in Section We end with a discussion in Section

A brief history of boosting

Schapire develop ed the rst simple b o osting pro cedure in the PAClearning framework

Valiant Kearns Vazirani Schapire showed that a weak learner could always improve

its p erformance by training two additional classiers on ltered versions of the input data stream

A weak learner is an algorithm for pro ducing a twoclass classier with p erformance guaranteed

with high probability to b e signicantly b etter than a coinip After learning an initial classier

h on the rst N training p oints

h is learned on a new sample of N p oints half of which are misclassied by h

h is learned on N p oints for which h and h disagree

The b o osted classier is h Majority Voteh h h

B

Schapires Strength of Weak Learnability theorem proves that h has improved p erformance over

B

h

Freund prop osed a boost by ma jority variation whichcombined many weak learners

simultaneously and improved the p erformance of the simple b o osting algorithm of Schapire The

theory supp orting b oth of these algorithms require the weak learner to pro duce a classier with a

xed error rate This led to the more adaptive and realistic AdaBo ost Freund Schapire b

and its ospring where this assumption was dropp ed

Freund Schapire bandSchapire Singer provide some theory to supp ort their

algorithms in the form of upp er b ounds on generalization error This theory has evolved in

the Computational Learning community initially based on the concepts of PAC learning Other

theories attempting to explain b o osting come from game theory Freund Schapire a Breiman

and VC theory Schapire Freund Bartlett Lee The b ounds and the theory

asso ciated with the AdaBo ost algorithms are interesting but tend to b e to o lo ose to b e of practical

imp ortance In practice b o osting achieves results far more impressive than the b ounds would imply

Additive Mo dels

P

M

We show in the next section that AdaBo ost ts an additive mo del F x c f x We

m m

m

b elieve that viewing current b o osting pro cedures as stagewise algorithms for tting additive mo dels

go es a long waytowards understanding their p erformance Additive mo dels have a long history in

statistics and so we rst give some examples here

Additive Regression Mo dels

We initially fo cus on the regression problem where the resp onse y is quantitative x and y have

some joint distribution and weareinterested in mo deling the mean E y jxF x The additive

mo del has the form

p

X

F x f x

j j

j

Here there is a separate function f x for each of the p input variables x More generally each

j j j

comp onent f is a function of a small presp ecied subset of the input variables The backtting

j

algorithm Friedman Stuetzle Buja Hastie Tibshirani is a convenient mo dular

GaussSeidel algorithm for tting additive mo dels Abacktting up date is

X

for j p f x E y f x x

j j j

k k

k j

Any metho d or algorithm for estimating a function of x can be used to obtain an estimate

j

of the conditional exp ectation in In particular this can include nonparametric smoothing

algorithms such as lo cal regression or smo othing splines In the right hand side all the latest

versions of the functions f are used in forming the partial residuals The backtting cycles are

k

rep eated until convergence Under fairly general conditions backtting can b e shown to converge

to the minimizer of E y F x Buja et al

Extended Additive Mo dels

M

are functions of p oten More generally one can consider additive mo dels whose elements ff xg

m

tially all of the input features x Usually in this context the f x are taken to b e simple functions

m

characterized by a set of parameters and a multiplier

m

f x bx

m m m

The additive mo del then b ecomes

M

X

F x bx

M m m

m

t

For example in single hidden layer neural networks bx x where is a sigmoid function

and parameterizes a linear combination of the input features In signal pro cessing wavelets are

a p opular choice with parameterizing the lo cation and scale shifts of a mother wavelet bx

M

In these applications fbx g are generally called basis functions since they span a function

m

subspace

If leastsquares is used as a tting criterion one can solve for an optimal set of parameters

through a generalized backtting algorithm with up dates

X

bx bx f garg min E y

m m

k k

k m

for m M in cycles until convergence Alternatively one can use a greedy forward

stepwise approach

f garg min E y F x bx

m m m

m

for m M where f g are xed at their corresp onding solution values at earlier

k k

iterations This is the approach used by Mallat Zhang in matching pursuit where the

bx are selected from an over complete dictionary of wavelet bases In the language of b o osting

f xbx would b e called a weak learner and F x the committee If decision trees

M

were used as the weak learner the parameters would represent the splitting variables split p oints

the constants in each terminal no de and numb er of terminal no des of each tree

Note that the backtting pro cedure or its greedy cousin only require an algorithm

for tting a single weak learner to data This base algorithm is simply applied rep eatedly to

mo died versions of the original data

X

y y f x

m

k

k m

In the forward stepwise pro cedure the mo died output y at the mth iteration dep ends only

m

on its value y and the solution f x at the previous iteration

m m

y y f x

m m m

At each step m the previous output values y are mo died so that the previous mo del

m

f x has no explanatory power on the new outputs y One can therefore view this as a

m m

pro cedure for b o osting a weak learner f xbx toformapowerful committee F x

M

Classication problems

For the classication problem we learn from Bayes theorem that all we need is P y j j x the

p osterior or conditional class probabilities One could transfer all the ab ove regression machinery

across to the classication domain by simply noting that E jx P y j j x where

y j y j

is the indicator variable representing class j While this works fairly well in general several

problems have been noted Hastie Tibshirani Buja for constrained regression metho ds

The estimates are typically not conned to and severe masking problems can occur when

there are more than two classes A notable exception is when trees are used as the regression

metho d and in fact this is the approach used by Breiman et al

Logistic regression is a p opular approach used in statistics for overcoming these problems For

atwo class problem an additive logistic mo del has the form

M

X

P y j x

log f x

m

P y jx

m

P

M

f x The monotone logit transformation on the left guarantees that for anyvalues of F x

m

m

R the probability estimates lie in inverting weget

F x

e

pxP y j x

F x

e

Here we have given a general additive form for F x sp ecial cases exist that are well known

in statistics In particular linear logistic regression McCullagh Nelder for example and

additive logistic regression Hastie Tibshirani are p opular These mo dels are usually t by

maximizing the binomial loglikeliho o d and enjoy all the asso ciated asymptotic optimality features

of maximum likeliho o d estimation

A generalized version of backtting called Lo cal Scoring in Hastie Tibshirani

can be used to t the additive logistic mo del by maximum likeliho o d Starting with guesses

P

f x f x F x f x andpx denedinwe form the working resp onse

p p

k k

px

y

z F x

px px

We then apply backtting to the resp onse z with observation weights px px to obtain

new f x This pro cess is rep eated until convergence The forward stagewise version of this

k k

pro cedure b ears a close similarity to the LogitBo ost algorithm describ ed later in the pap er

AdaBo ost an Additive Logistic Regression Mo del

In this section we show that the AdaBo ost algorithms Discrete and Real can be interpreted as

stagewise estimation pro cedures for tting an additive logistic regression mo del They optimize an

exp onential criterion which to second order is equivalent to the binomial loglikeliho o d criterion

We then prop ose a more standard likeliho o dbased b o osting pro cedure

An Exp onential Criterion

Consider minimizing the criterion

yF x

J F E e

for estimation of F x Lemma shows that the function F x that minimizes J F is the

symmetric logistic transform of P y j x

yF x

Lemma E e is minimized at

P y j x

log F x

P y jx

Hence

F x

e

P y jx

F x F x

e e

F x

e

P y jx

F x F x

e e

E represents exp ectation dep ending on the context this maybea p opulation exp ectation with resp ect to a

probability distribution or else a sample average E indicates a weighted exp ectation

w

Pro of

While E entails exp ectation over the joint distribution of y and x it is sucient to minimize the

criterion conditional on x

yF x F x F x

E e x P y j xe P y jxe

yF x

x E e

F x F x

P y jxe P y jxe

F x

The result follows by setting the derivative to zero



This exp onential criterion app eared in Schapire Singer motivated as an upp er b ound

on misclassication error Breiman also used this criterion in his results on AdaBo ost and

as in bymultiplying prediction games The usual logistic transform do es not have the factor

F x

the numerator and denominator in by e we get the usual logistic mo del

F x

e

px

F x

e

Hence the two mo dels are equivalent up to a factor

Corollary If E is replaced by averages over regions of x where F x is constant as in the

terminal node of a decision tree the same result applies to the sample proportions of y and

y

Results and show that b oth Discrete and Real AdaBo ost as well as the Generalized Ad

aBo ost of Freund Schapire b can b e motivated as iterative algorithms for optimizing the

p opulation based exp onential criterion The results share the same format

given an imp erfect F x an up date F xf x is prop osed based on the p opulation version

of the criterion

the up date whichinvolves p opulation conditional exp ectations is imp erfectly approximated

for nite data sets by some restricted class of estimators suchasaverages in terminal no des

of trees

Hastie Tibshirani use a similar derivation of the lo cal scoring algorithm used in tting

generalized additive mo dels Many terms are typically required in practice since at each stage

the approximation to conditional exp ectation is rather crude Because of Lemma the resulting

algorithms can be interpreted as a stagewise estimation pro cedure for tting an additive logistic

regression mo del The derivations are suciently dierenttowarrant separate treatment

Result The Discrete AdaBoost algorithm population version builds an additive logistic regres

yF x

sion model via Newtonlike updates for minimizing E e

Derivation

yF x

Let J F E e Supp ose wehave a current estimate F x and seek an improved estimate

F xcf x For xed c and x we expand J F xcf x to second order ab out f x

y F xcf x

J F cf E e

yF x

E e ycf xc y f x

yF x

E e ycf xc

since y and f x Minimizing p ointwise with resp ect to f x f gwe write

f x arg min E ycf xc j x

w

f

yF x

Here the notation E j x refers to a weightedconditional expectationwherew w x y e

w

and

E w x y g x y j x

def

E g x y j x

w

E w x y j x

For c minimizing is equivalent to maximizing

E yf x

w

The solution is

if E y jxP y jx P y jx

w w w

f x

otherwise

Note that

E yf x E y f x

w w

again using f x y Thus minimizing a quadratic approximation to the criterion leads

to a weighted leastsquares choice of f x f g and this constitutes the Newtonlike step

Given f x f g we can directly minimize J F cf to determine c

cy f x

c arg min E e

w

c

err

log

err

where err E Note that c can be negative if the weak learner do es worse than

w

y f x

in whichcase it automatically reverses the p olarity Combining these steps we get the up date for

F x

err

log f x F x F x

err

In the next iteration the new contribution cf xtoF x augments the weights

cf xy

w x y w x y e

Since yf x we see that the up date is equivalentto

y f x

err

w x y w x y exp log

y f x

err

Thus the function and weight up dates are of an identical form to those used in Discrete Ad

aBo ost



This p opulation version of AdaBo ost translates naturally to a data version using trees The

weighted conditional exp ectation in is approximated bythe terminalno de weighted averages

in a tree In particular the weighted criterion is used to grow the treebased classier

f x and given f x the constant c is based on the weighted training error

Note that after each Newton step the weights change and hence the tree conguration will

change as well This adds an adaptivetwist to the data version of Newtonlike algorithm

Parts of this derivation for AdaBo ost can be found in Breiman and Schapire Singer

but without making the connection to additive logistic regression mo dels

Corollary After each update to the weights the weighted misclassication error of the most

recent weak learner is

Pro of

This follows by noting that the c that minimizes J F cf satises

J F cf

y F xcf x

E e yf x

c

The result follows since yf x is for a correct and for an incorrect classication



Schapire Singer give the interpretation that the weights are up dated to make the new

weighted problem maximally dicult for the next weak learner

The Discrete AdaBo ost algorithm exp ects the tree or other weak learner to deliver a classier

f x f g Result requires minor mo dications to accommo date f x R as in the

generalized AdaBo ost algorithms Freund Schapire bSchapire Singer the estimate

for c diers Fixing f we see that the minimizer of must satisfy

m

cy f x

E yf xe

w

If f is not discrete this equation has no closedform solution for c and requires an iterative solution

such as NewtonRaphson

We now derive the Real AdaBo ost algorithm which uses weighted probability estimates to

up date the additive logistic mo del rather than the classications themselves Again we derive

the p opulation up dates and then apply it to data by approximating conditional exp ectations by

terminalno de averages in trees

Result The Real AdaBoost algorithm ts an additive logistic regression model by stagewise and

yF x

approximate optimization of J F E e

Derivation

Supp ose wehave a current estimate F x and seek an improved estimate F xf xby minimizing

J F xf x at each x

yF x yf x

J F xf x E e e j x

f x yF x f x yF x

e E e jxe E e jx

y y

yF x

Dividing through by E e jx and setting the derivative wrt f xtozeroweget

E jx

w

y

log f x

E j x

w

y

P y j x

w

log

P y jx

w

where w x y expyF x The weights get up dated by

yf x

w x y w x y e

The algorithm as presented would stop after one iteration In practice we use crude approxi

mations to conditional exp ectation such as decision trees or other constrained mo dels and hence

many steps are required



Corollary At the optimal F x the weighted conditional mean of y is

Pro of

If F x is optimal wehave

J F x

yF x

Ee y

F x



We can think of the weights as providing an alternative to residuals for the binary classication

problem At the optimal function F there is no further information ab out F in the weighted

conditional distribution of y If there is we use it to up date F

At iteration M in either the Discrete or Real AdaBo ost algorithms we have comp osed an

additive function of the form

M

X

f x F x

m

m

where each of the comp onents are found in a greedy forward stagewise fashion xing the earlier

comp onents Our term stagewise refers to a similar approach in Statistics

Variables are included sequentially in a stepwise regression

The co ecients of variables already included receive no further adjustment

yF x

Why Ee

So far the only justication for this exp onential criterion is that it has a sensible p opulation mini

mizer and the algorithm describ ed ab ove p erforms wellonrealdata In addition

yF x

Schapire Singer motivate e as a dierentiable upp erb ound to misclassication

error see Fig

yF

the AdaBo ost algorithm that it generates is extremely mo dular requiring at each iteration

the retraining of a classier on a weighted training database

Let y y taking values and parametrize the binomial probabilities by

F x

e

px

F x F x

e e

The binomial loglikeliho o d is

y px y log px y log px

yF x

log e

Figure A variety of loss functions for estimating a function F x for classication The horizontal axis is

yF which is negative for errors and positive for correct classications Al l the loss functions are monotone

yF

in yF and are centered and scaled to match e at F The curve labeled Loglikelihood is the

 

binomial loglikelihood or crossentropy y log p  y log  p The curve labeled SquaredErrorp

  

is y  p The curve labeled Squared ErrorF is y  F and increases once yF exceeds thereby

increasingly penalizing classications that aretoocorrect

Hence we see that

yF x

The p opulation minimizers of Ey px and Ee coincide This is easily seen be

cause the exp ected loglikeliho o d is maximized at the true probabilities pxP y jx

yF x

which dene the logit F x By Lemma we see that this is exactly the minimizer of Ee

In fact the exp onential criterion and the negative loglikeliho o d are equivalent to second

order in a Taylor series around F

p expyF log y

yF x

Graphs of expyF and log e are shown in Fig as a function of yF

p ositivevalues of yF imply correct classication Note that expyF itself is not a prop er

loglikeliho o d as it do es not equal the log of any probability mass function on

There is another wayto view the criterion J F It is easy to showthat

jy pxj

yF x

p

e

px px

with F x log px px The righthand side is known as the statistic in the

statistical literature is a quadratic approximation to the loglikeliho o d and so can be

considered a gentler alternative

One feature of b oth the exp onential and loglikeliho o d criteria is that they are monotone and

smo oth Even if the training error is zero the criteria will drive the estimates towards purer

solutions in terms of probability estimates

Why not estimate the f by minimizing the squared error E y F x If F x

m m

P

m

f x is the current prediction this leads to a forward stagewise pro cedure that do es an

j

unweighted t to the resp onse y F xatstepm as in Empirically wehave found that this

m

approachworks quite well but is dominated by those that use monotone loss criteria We b elieve

that the nonmonotonicity of squared error loss Fig is the reason Correct classications but

with yF x incur increasing loss for increasing values of jF xj This makes squarederror

loss an esp ecially p o or approximation to misclassication error rate Classications that are to o

correct are p enalized as much as misclassication errors

Direct optimization of the binomial loglikeliho o d

In this section we explore algorithms for tting additive logistic regression mo dels by stagewise

optimization of the Bernoulli loglikeliho o d Here we fo cus again on the twoclass case and will

use a resp onse y to represent the outcome We represent the probabilityof y by px

where

F x

e

px

F x F x

e e

Algorithm gives the details

Result The LogitBoost algorithm classes population version uses Newton steps for tting

an additive symmetric logistic model by maximum likelihood

Derivation

Consider the up date F xf x and the exp ected loglikeliho o d

F xf x

EF f E y F xf x log e

Conditioning on xwe compute the rst and second derivativeat f x

EF xf x

sx j

f x

f x

E y pxj x

EF xf x

H x

f x

f x

E px pxj x

LogitBo ost classes

Start with weights w N i N F x and probability estimates px

i i

Rep eat for m M

a Compute the working resp onse and weights

y px

i

i

z

i

px px

i i

w px px

i i i

b Fit the function f x by a weighted leastsquares regression of z to x using weights

m i i

w

i

F x

e

c Up date F x F x f x and px

m

F x F x

e e

P

M

Output the classier signF x sign f x

m

m

Algorithm An adaptive Newton algorithm for tting an additive logistic regression model

where px is dened in terms of F x The Newton up date is then

F x F x H x sx

E y pxj x

F x

E px pxj x

y px

F x E j x

w

px px

where w x px px Equivalently the Newton up date f x solves the weighted least

squares approximation ab out F x to the loglikeliho o d

y px

F x min E F xf x

w x

px px

f x



The p opulation algorithm describ ed here translates immediately to an implementation on data

when E j x is replaced by a regression metho d such as regression trees Breiman et al

While the role of the weights are somewhat articial in the p opulation case they are not in any

implementation w x is constant when conditioned on x but the w x in a terminal no de of a

i

tree for example dep end on the currentvalues F x and will typically not b e constant

i

Sometimes the w x get very small in regions of x p erceived by F x to be purethat is

when px is close to or This can cause numerical problems in the construction of z and led

to the following crucial implementation protections

y p

as Since this number can get large if p is small If y then compute z

pp p

threshold this ratio at zmax The particular value chosen for zmax is not crucial we have

found empirically that zmax works well Likewise if y compute z with

p

alower threshold of zmax

Enforce a lower threshold on the weights w max w machinezero

yF x

Optimizing Ee by Newton stepping

y F xf x

The p opulation version of the Real AdaBo ost pro cedure Algorithm optimizes Ee

exactly with resp ect to f at each iteration In Algorithm we prop ose the Gentle AdaBo ost pro

cedure that instead takes adaptive Newton steps muchlike the LogitBo ost algorithm just describ ed

Gentle AdaBo ost

Start with weights w N i N F x

i

Rep eat for m M

a Fit the regression function f xbyweighted leastsquares of y to x with weights w

m i i i

b Up date F x F xf x

m

y f x

m

i i

c Up date w w e and renormalize

i i

P

M

Output the classier signF x sign f x

m

m

Algorithm A modied version of the Real AdaBoost algorithm using Newton stepping rather than

exact optimization at each step

Result The Gentle AdaBoost algorithm population version uses Newton steps for minimizing

yF x

Ee

Derivation

J F xf x

yF x

E e y jx

f x

f x

J F xf x

yF x

E e j x since y

f x

f x

Hence the Newton up date is

yF x

E e y jx

F x F x

yF x

E e jx

F xE y jx

w

yF x

where w x y e



The main dierence b etween this and the Real AdaBo ost algorithm is how it uses its estimates

of the weighted class probabilities to up date the functions Here the up date is f x P y

m w

P y j x

w

log Logratios j x P y j x rather than half the logratio as in f x

w m

P y j x

w

can b e numerically unstable leading to very large up dates in pure regions while the up date here

lies in the range Empirical evidence suggests see Section that this more conservative

algorithm has similar p erformance to both the Real AdaBo ost and LogitBo ost algorithms and

often outp erforms them b oth esp ecially when stability is an issue

There is a strong similaritybetween the up dates for the Gentle AdaBo ost algorithm and those

F x

e

Then for the LogitBo ost algorithm Let P P y jx and px

F x F x

e e

yF x F x F x

E e y jx e P e P

yF x F x F x

E e jx e P e P

P px

pxP px P

The analogous expression for LogitBo ost from is

P px

px px

At px these are nearly the same but they dier as the pxbecomeextreme For example

if P andpx blows up while is ab out and always falls in

Multiclass pro cedures

Here we explore extensions of b o osting to classication with multiple classes We start o by

prop osing a natural generalization of the twoclass symmetric logistic transformation and then

consider sp ecic algorithms In this context Schapire Singer dene J resp onses y for a J

j

class problem each taking values in f g Similarly the indicator response vector with elements

y is more standard in the statistics literature Assume the classes are mutually exclusive

j

Denition For a J class problem let p x P y j x We dene the symmetric multiple

j j

logistic transformation

J

X

F x log p x log p x

j j

k

J

k

Equivalently

F x

j

P

e

J

F x p x

P

j

k

k

J

F x

k

e

k

The centering condition in is for numerical stability only it simply pins the F down else we

j

could add an arbitrary constanttoeach F and the probabilities remain the same The equivalence

j

of these two denitions is easily established as well as the equivalence with the twoclass case

Schapire Singer provide several generalizations of AdaBo ost for the multiclass case

and also refer to other prop osals Freund Schapire Schapire we describ e their

AdaBoostMH algorithm see boxed Algorithm since it seemed to dominate the others in their

empirical studies We then connect it to the mo dels presented here We will refer to the augmented

variable in Algorithm as the class variable C Wemake a few observations

P

J

y F x

j j

The p opulation version of this algorithm minimizes Ee which is equivalent

j

to running separate p opulation b o osting algorithms on each of the J problems of size N

obtained by partitioning the N J samples in the obvious fashion This is seen trivially by

rst conditioning on C j and then xjC j when computing conditional exp ectations

The same is almost true for a treebased algorithm We see this b ecause

AdaBo ostMH Schapire Singer

Expand the original N observations into N J pairs

x y x y x Jy i N Here y is the f g resp onse for

i i i i i iJ ij

class j and observation i

Apply Real AdaBo ost to the augmented dataset pro ducing a function F X J R

P

F x j f x j

m

m

Output the classier argmax F x j

j

Algorithm The AdaBoostMH algorithm converts the J class problem into that of estimating a class

classier on a training set J times as large with an additional feature dened by the set of class labels

If the rst split is on C eitheraJ nary split if p ermitted or else J binary splits

then the subtrees are identical to separate trees grown to eachoftheJ groups This

will always b e the case for the rst tree

If a tree do es not split on C anywhere on the path to a terminal no de then that

no de returns a function f x j g x that contributes nothing to the classication

m m

decision However as long as a tree includes a split on C at least once on every path to

a terminal no de it will makeacontribution to the classier for all input feature values

The advantagedisadvantage of building one large tree using class lab el as an additional input

feature is not clear No motivation is provided We therefore implement AdaBo ostMH using

P

J

y F x

j j

the more traditional direct approach of building J separate trees to minimize Ee

j

Wehavethus shown

Result The AdaBoostMH algorithm for a J class problem ts J uncoupled additive logistic

models G x log p x p x each class against the rest

j j j

In principal this parametrization is ne since G x is monotone in p x However we are esti

j j

mating the G x in an uncoupled fashion and there is no guarantee that the implied probabilities

j

sum to We give some examples where this makes a dierence and AdaBo ostMH p erforms more

p o orly than an alternative coupled likeliho o d pro cedure

Schapire and Singers AdaBo ostMH was also intended to cover situations where observations

can b elong to more than one class The MH represents MultiLab el Hamming Hamming loss

J

b eing used to measure the errors in the space of p ossible class lab els In this context tting a

separate classier for each lab el is a reasonable strategy However Schapire and Singer also prop ose

using AdaBo ostMH when the class lab els are mutually exclusive which is the fo cus in this pap er

Algorithm is a natural generalization of algorithm for tting the J class logistic regression

mo del

Result The LogitBoost algorithm J classes population version uses quasiNewton steps for

tting an additive symmetric logistic model by maximumlikelihood

Derivation

We rst give the score and Hessian for the p opulation Newton algorithm corresp onding to a

standard multilogit parametrization

P y j x

j

G x log

j

P y j x

J

LogitBo ost J classes

Start with weights w N i N j J F x and p xJ j

ij j j

Rep eat for m M

a Rep eat for j J

i Compute working resp onses and weights in the j th class

y p x

j i

ij

z

ij

p x p x

j i j i

w p x p x

ij j i j i

ii Fit the function f x by a weighted leastsquares regression of z to x with

mj ij i

weights w

ij

P

J

J

f x f x and F x F xf x b Set f x

mj j j mj mj

mk

k

J J

c Up date p x via

j

Output the classier argmax F x

j

j

Algorithm An adaptive Newton algorithm for tting an additive multiple logistic regression model

with G x and the choice of J for the base class is arbitrary The exp ected conditional

J

loglikeliho o d is

J J

X X

G xg x

k k

E y E G g jx jxG xg x log e

j j

j

j

k

s x E y p xj x j J

j j

j

H x p x p x j k J

j

jk jk k

Our quasiNewton up date amounts to using a diagonal approximation to the Hessian pro

ducing up dates

E y p xj x

j

j

g x j J

j

p x p x

j j

To convert to the symmetric parametrization we would note that g and set f x

J j

P

J

g x g x However this pro cedure could b e applied using any class as the base

j

k

k

J

not just the J th By averaging over all choices for the base class we get the up date

J

X

E y p xj x

J E y p xj x

j

j

k

k

f x

j

J p x p x J p x p x

j j

k k

k



For more rigid parametric mo dels and full Newton stepping this symmetrization would b e redun

dant With quasiNewton steps and adaptive tree based mo dels the symmetrization removes the

dep endence on the choice of the base class

Simulation studies

In this section the four avors of b o osting outlined ab ove are applied to several articially con

structed problems Comparisons based on real data are presented in Section

An advantage of comparisons made in a simulation setting is that all asp ects of each example

are known including the Bayes error rate and the complexity of the decision b oundary In addition

the p opulation exp ected error rates achieved byeach of the resp ective metho ds can b e estimated to

arbitrary accuracy byaveraging over a large numb er of dierent training and test data sets drawn

from the p opulation The four b o osting metho ds compared here are

DAB Discrete AdaBo ost Algorithm

RAB Real AdaBo ost Algorithm

LB LogitBo ost Algorithms and

GAB Gentle AdaBo ost Algorithm

DAB RAB and GAB handle multiple classes using the AdaBo ostMH approach

In an attempt to dierentiate p erformance all of the simulated examples involve fairly complex

decision b oundaries The ten input features for all examples are randomly drawn from a ten

dimensional standard normal distribution x N I For the rst three examples the decision

b oundaries separating successive classes are nested concentric tendimensional spheres constructed

by thresholding the squaredradius from the origin

X

x r

j

j

Each class C k K is dened as the subset of observations

k

C fx j t r t g

i

k k k

i

K

with t and t The ft g for each example were chosen so as to put approximately

K

k

equal numbers of observations in each class The training sample size is N K so that

approximately training observations are in each class An indep endently drawn test set of

observations was used to estimate error rates for each training set Averaged results over

ten such indep endently drawn trainingtest set combinations were used for the nal error rate

estimates The corresp onding statistical uncertainties standard errors of these nal estimates

averages are approximately a line width on eachplot

Figure topleft compares the four algorithms in the twoclass K case using a two

terminal no de decision tree stump as the base classier Shown is error rate as a function of

numb er of b o osting iterations The upp er black line represents DAB and the other three nearly

coincident lines are the other three metho ds dotted red RAB shortdashed green LB and

longdashed blueGAB Note that the somewhat erratic b ehavior of DAB esp ecially for less that

iterations is not due to statistical uncertainty For less than iterations LB has a minuscule

edge after that it is a dead heat with RAB and GAB DAB shows substantially inferior p erformance

here with roughly twice the error rate at all iterations

Figure lowerleft shows the corresp onding results for three classes K again with two

terminal no de trees Here the problem is more dicult as represented by increased error rates for

all four metho ds but their relationship is roughly the same the upp er black line represents DAB

Figure Test error curves for the simulation experiment with an additive decision boundary as described

in on page In al l panels except the the top right the solid curve representing Discrete AdaBoost

lies alone above the other three curves

and the other three nearly coincident lines are the other three metho ds The situation is somewhat

dierent for larger number of classes Figure lowerright shows results for K which are

typical for K As b efore DAB incurs much higher error rates than all the others and RAB

and GAB have nearly identical p erformance However the p erformance of LB relative to RAB and

GAB has changed Up to ab out iterations it has the same error rate From to ab out

iterations LBs error rates are slightly higher than the other two After iterations the error

rate for LB continues to improve whereas that for RAB and GAB level o decreasing muchmore

slowly By iterations the error rate for LB is whereas that for RAB and GAB is

Sp eculation as to the reason for LBs p erformance gain in these situations is presented b elow

In the ab ove examples a stump was used as the base classier One might exp ect the use of

larger trees would do b etter for these rather complex problems Figure topright shows results

for the twoclass problem here b o osting trees with eight terminal no des These results can be

compared to those for stumps in Fig topleft Initially error rates for b o osting eight no de

trees decrease much more rapidly than for stumps with each successive iteration for all metho ds

However the error rates quickly level o and improvementisvery slow after ab out iterations

The overall p erformance of DAB is much improved with the bigger trees coming close to that

of the other three metho ds As b efore RAB GAB and LB exhibit nearly identical p erformance

Note that at each iteration the eightno de tree mo del consists of fourtimes the numb er of additive

terms as do es the corresp onding stump mo del This is why the error rates decrease so muchmore

rapidly in the early iterations In terms of mo del complexity and training time a iteration

mo del using eightterminal no de trees is equivalent to a iteration stump mo del

Comparing the toptwo panels in Fig one sees that for RAB GAB and LB the error rate

using the bigger trees is in fact higher than that for stumps at iterations

even though theformerisfourtimesmore complex This seemingly mysterious b ehavior is easily

understo o d by examining the nature of the decision b oundary separating the classes The Bayes

decision b oundary b etween two classes is the set

P y j x

x log

P y j x

or simply fx B xg To approximate this set it is sucient to estimate the logit B x or any

monotone transformation of B x as closely as p ossible As discussed ab ove b o osting pro duces

an additive logistic mo del whose comp onent functions are represented by the base classier With

stumps as the base classier each comp onent function has the form

L R

f x c c

m

x t x t

m m

m m

j j

f x

m j

L R

if the mth stump chose to split on co ordinate j Here t is the splitp oint and c and c are the

m

m m

weighted means of the resp onse in the left and right terminal no des Thus the mo del pro duced by

b o osting stumps is additive in the original features

p

X

F x g x

j j

j

where g x adds together all those stumps involving x and is if none exist

j j j

Examination of and reveals that an optimal decision b oundary for the ab ove examples

is also additive in the original features with f x x constant Thus in the context of

j j

j

decision trees stumps are ideally matched to these problems larger trees are not needed However

b o osting larger trees need not b e counter pro ductive in this case if all of the splits in each individual

tree are made on the same predictor variable This would also pro duce an additive mo del in the

original features However due to the forward greedy stagewise strategy used by b o osting

this is not likely to happ en if the decision b oundary function involves more than one predictor

each individual tree will try to do its b est to involve all of the imp ortant predictors Owing to the

nature of decision trees this will pro duce mo dels with interaction eects most terms in the mo del

will involve pro ducts in more than one variable Such nonadditive mo dels are not as well suited

for approximating truly additive decision b oundaries such as and This is reected in

increased error rate as observed in Fig

The ab ove discussion also suggests that if the decision b oundary separating pairs of classes were

inherently nonadditive in the predictors then b o osting stumps would be less advantageous than

using larger trees A tree with m terminal no des can pro duce basis functions with a maximum

interaction order of minm p where p is the number of predictor features These higher

order basis functions provide the p ossibility to more accurately estimate those decision b oundaries

B x with high order interactions The purp ose of the next example is to verify this intuition

There are two classes K and training observations with the fx g drawn from a ten

i

dimensional normal distribution as in the previous examples Class lab els were randomly assigned

to each observation with logo dds

X X

Pry j x

l

x log x

j

l

Pry j x

j

l

Approximately equal numbers of observations are assigned to each of the two classes and the

Bayes error rate is The decision b oundary for this problem is a complicated function of the

rst six predictor variables involving all of them in second order interactions of equal strength As

in the ab ove examples test sets of observations was used to estimate error rates for each

training set and nal estimates were averages over ten replications

Figure topleft shows testerror rate as a function of iteration number for each of the four

b o osting metho ds using stumps As in the previous examples RAB and GAB trackeach other very

closely DAB b egins very slowly b eing dominated by all of the others until around iterations

where it passes below RAB and GAB LB mostly dominates having the lowest error rate until

ab out iterations At that p ointDAB catches up and by iterations it mayhaveavery slight

edge However none of these boosting metho ds p erform well with stumps on this problem the

b est error rate b eing

Figure topright shows the corresp onding plot when four terminal no de trees are b o osted

Here there is a dramatic improvement with all of the four metho ds For the rst time there is some

small dierentiation between RAB and GAB At nearly all iterations the p erformance ranking is

LB b est followed by GAB RAB and DAB in order At iterations LB achieves an error rate

of Figure lowerleft shows results when eight terminal no de trees are b o osted Here

error rates are generally further reduced with LB improving the least but still dominating

The p erformance ranking among the other three metho ds changes with increasing iterations DAB

overtakes RAB at around iterations and GAB at ab out b ecoming fairly close to LB by

iterations with an error rate of

Although limited in scop e these simulation studies suggest several trends They explain why

b o osting stumps can sometimes b e sup erior to using larger trees and suggest situations where this

is likely to be the case that is when decision b oundaries B x can be closely approximated by

functions that are additive in the original predictor features When higher order interactions are

Figure Test error curves for the simulation experiment with a nonadditive decision boundary as described

in on page

required stumps exhibit p o or p erformance These examples illustrate the close similaritybetween

RAB and GAB In all cases the dierence in p erformance b etween DAB and the others decreases

when larger trees and more iterations are used sometimes overtaking the others More generally

relative p erformance of these four metho ds dep ends on the problem at hand in terms of the nature of

the decision b oundaries the complexity of the base classier and the numb er of b o osting iterations

The sup erior p erformance of LB in Fig lowerright app ears to b e a consequence of the multi

class logistic mo del Algorithm All of the other metho ds use the asymmetric AdaBo ostMH

strategy Algorithm of building separate twoclass mo dels for each individual class against

the p o oled complement classes Even if the decision b oundaries separating all class pairs are

relatively simple p o oling classes can pro duce complex decision b oundaries that are dicult to

approximate Friedman By considering all of the classes simultaneously the symmetric

multiclass mo del is b etter able to take advantage of simple pairwise b oundaries when they exist

Hastie Tibshirani As noted ab ove the pairwise b oundaries induced by and are

simple when viewed in the context of additive mo deling whereas the p o oled b oundaries are more

complex they cannot b e well approximated by functions that are additive in the original predictor

variables

The decision b oundaries asso ciated with these examples were delib erately chosen to b e geomet

rically complex in an attempt to illicit p erformance dierences among the metho ds b eing tested

Such complicated b oundaries are not likely to often occur in practice Many practical problems

involve comparatively simple b oundaries Holte in such cases p erformance dierences will

still b e situation dep endent but corresp ondingly less pronounced

Some exp eriments with Real World Data

In this section we show the results of running the four tting metho ds LogitBo ost Discrete

AdaBo ost Real AdaBo ost and Gentle AdaBo ost on a collection of datasets from the UCIrvine

archive plus a p opular simulated dataset The base learner is a tree in each case

with either or terminal no des For comparison a single decision tree was also t with the tree

size determined by fold crossvalidation

The datasets are summarized in Table The test error rates are shown in Table for the smaller

datasets and in Table for the larger ones The vowel sonar satimage and letter datasets

come with a presp ecied test set The waveform data is simulated as describ ed in Breiman

et al For the others fold crossvalidation was used to estimate the test error

It is dicult to discern trends on the small data sets Table b ecause all but quite large

observed dierences in p erformance could be attributed to sampling uctuations On the vowel

breast cancer ionosphere sonarandwaveform data purely additive stump mo dels seem to

p erform comparably to the larger eightno de trees The glass data seems to b enet a little from

larger trees There is no clear dierentiation in p erformance among the b o osting metho ds

On the larger data sets Table clearer trends are discernible For the satimage data the

eightno de tree mo dels are only slightly but signicantly more accurate than the purely additive

mo dels For the letter data there is no contest Bo osting stumps is clearly inadequate There is no

clear dierentiation among the b o osting metho ds for eightno de trees For the stumps LogitBo ost

Real AdaBo ost and Gentle AdaBo ost have comparable p erformance distinctly sup erior to Discrete

AdaBo ost This is consistent with the results of the simulation study Section

Except p erhaps for Discrete AdaBo ost the real data examples fail to demonstrate p erformance

using the tree function in Splus

Table Datasets used in the experiments

Data set Train Test Inputs Classes

vowel

breast cancer fold CV

ionosphere fold CV

glass fold CV

sonar fold CV

waveform

satimage

letter

dierences between the various boosting metho ds This is in contrast to the simulated data sets

of Section There LogitBo ost generally dominated although often by a small margin The

inabilityof the real data examples to discriminate may reect statistical diculties in estimating

subtle dierences with small samples Alternatively it may be that the their underlying decision

b oundaries are all relatively simple Holte so that all reasonable metho ds exhibit similar

p erformance

Additive Logistic Trees

In most applications of b o osting the base classier is considered to b e a primitive rep eatedly called

by the b o osting pro cedure as iterations pro ceed The op erations p erformed by the base classier are

the same as they would b e in anyothercontext given the same data and weights The fact that the

nal mo del is going to b e a linear combination of a large numb er of such classiers is not taken into

account In particular when using decision trees the same tree growing and pruning algorithms

are generally employed Sometimes alterations are made such as no pruning for programming

convenience and sp eed

When b o osting is viewed in the light of additive mo deling however this greedy approach can

b e seen to b e far from optimal in many situations As discussed in Section the goal of the nal

classier is to pro duce an accurate approximation to the decision b oundary function B x In the

context of b o osting this goal applies to the nal additive mo del not to the individual terms base

classiers at the time they were constructed For example it was seen in Section that if B x

was close to b eing additive in the original predictive features then b o osting stumps was optimal

since it pro duced an approximation with the same structure Building larger trees increased the

error rate of the nal mo del b ecause the resulting approximation involved high order interactions

among the features The larger trees optimized error rates of the individual base classiers given

the weights at that step and even pro duced lower unweighted error rates in the early stages But

after a sucientnumb er of b o osts the stump based mo del achieved sup erior p erformance

More generally one can consider an expansion of the of the decision b oundary function in a

functional ANOVA decomp osition Friedman

X X X

f x f x x B x f x x x

j j j j

jk k jkl k l

j

jk jk l

The rst sum represents the closest function to B x that is additive in the original features the

rst two represent the closest approximation involving at most twofeature interactions the rst

Table Test error rates on smal l real examples

Metho d Terminal No des Terminal No des

Iterations

Vowel CART error

LogitBo ost

Real AdaBo ost

Gentle AdaBo ost

Discrete AdaBo ost

Breast CART error

LogitBo ost

Real AdaBo ost

Gentle AdaBo ost

Discrete AdaBo ost

Ion CART error

LogitBo ost

Real AdaBo ost

Gentle AdaBo ost

Discrete AdaBo ost

Glass CART error

LogitBo ost

Real AdaBo ost

Gentle AdaBo ost

Discrete AdaBo ost

Sonar CART error

LogitBo ost

Real AdaBo ost

Gentle AdaBo ost

Discrete AdaBo ost

Waveform CART error

LogitBo ost

Real AdaBo ost

Gentle AdaBo ost

Discrete AdaBo ost

Table Test error rates on larger data examples

Metho d Terminal Iterations Fraction

No des

Satimage CART error

LogitBo ost

Real AdaBo ost

Gentle AdaBo ost

Discrete AdaBo ost

LogitBo ost

Real AdaBo ost

Gentle AdaBo ost

Discrete AdaBo ost

Letter CART error

LogitBo ost

Real AdaBo ost

Gentle AdaBo ost

Discrete AdaBo ost

LogitBo ost

Real AdaBo ost

Gentle AdaBo ost

Discrete AdaBo ost

three represent threefeature interactions and so on If B x can be accurately approximated by

such an expansion truncated at lowinteraction order then allowing the base classier to pro duce

higher order interactions can reduce the accuracy of the nal b o osted mo del In the context of

decision trees higher order interactions are pro duced by deep er trees

In situations where the true underlying decision b oundary function admits a low order ANOVA

decomp osition one can take advantage of this structure to improve accuracy by restricting the

depth of the base decision trees to be not much larger than the actual interaction order of B x

Since this is not likely to be known in advance for any particular problem this maximum depth

b ecomes a metaparameter of the pro cedure to b e estimated by some mo del selection technique

such as crossvalidation

One can restrict the depth of an induced decision tree by using its standard pruning pro cedure

starting from the largest p ossible tree but requiring it to delete enough splits to achieve the desired

maximum depth This can b e computationally wasteful when this depth is small The time required

to build the tree is prop ortional to the depth of the largest p ossible tree b efore pruning Therefore

dramatic computational savings can be achieved by simply stopping the growing pro cess at the

maximum depth or alternatively at a maximum numb er of terminal no des The standard heuristic

arguments in favor of growing large trees and then pruning do not apply in the context of b o osting

Shortcomings in any individual tree can be comp ensated by trees grown later in the b o osting

sequence

If a truncation strategy based on numb er of terminal no des is to b e employed it is necessary to

dene an order in which splitting takes place We adopt a b estrst strategy An optimal split

is computed for each currently terminal no de The no de whose split would achieve the greatest

reduction in the tree building criterion is then actually split This increases the numb er of terminal

no des by one This continues until a maximum number M of terminal notes is induced Stan

dard computational tricks can be employed so that inducing trees in this order requires no more

computation than other orderings commonly used in decision tree induction

The truncation limit M is applied to all trees in the b o osting sequence It is thus a meta

parameter of the entire b o osting pro cedure An optimal value can b e estimated through standard

mo del selection techniques such as minimizing crossvalidated error rate of the nal b o osted mo del

We refer to this combination of truncated b estrst trees with b o osting as additive logistic trees

Figure Coordinate functions for the additive logistic tree obtained by boosting Logitboost with stumps

for the twoclass nested sphere example from Section

ALT Bestrst trees were used in all of the simulated and real examples One can compare results

on the latter Tables and to corresp onding results rep orted by Dietterich Table on

common data sets Error rates achieved by ALT with very small truncation values are seen to

compare quite favorably with other committee approaches using much larger trees at each b o osting

step Even when error rates are the same the computational savings asso ciated with ALTcanbe

quite imp ortant in data mining contexts where large data sets cause computation time to b ecome

an issue

Another advantage of low order approximations is mo del visualization In particular for mo dels

additive in the input features the contribution of each feature x can be viewed as a graph

j

of g x plotted against x Figure shows such plots for the ten features of the twoclass nested

j j j

spheres example of Fig The functions are shown for the rst class concentrated near the origin

the corresp onding functions for the other class are the negatives of these functions

The plots in Fig clearly show that the contribution to the logo dds of each individual feature

is approximately quadratic whichmatches the generating mo del and

When there are more than two classes plots similar to Fig can be made for each class and

analogously interpreted Higher order interaction mo dels are more dicult to visualize If there are

at most twofeature interactions the twovariable contributions can be visualized using contour

or p ersp ective mesh plots Beyond twofeature interactions visualization techniques are even less

eective Even when noninteraction stump mo dels do not achieve the highest accuracy they can

be very useful as descriptive statistics owing to the interpretability of the resulting mo del

Weight trimming

In this section we prop ose a simple idea and show that it can dramatically reduce computation

for b o osted mo dels without sacricing accuracy Despite its apparent simplicity this approach

do es not app ear to b e in common use although similar ideas have b een prop osed b efore Schapire

Freund Ateach b o osting iteration there is a distribution of weights over the training

sample As iterations pro ceed this distribution tends to become highly skewed towards smaller

weightvalues A larger fraction of the training sample b ecomes correctly classied with increasing

condence thereby receiving smaller weights Observations with very low relative weight have

little impact on training of the base classier only those that carry the dominant prop ortion of the

weight mass are inuential Thefractionofsuchhighweight observations can b ecome very small in

later iterations This suggests that at any iteration one can simply delete from the training sample

the large fraction of observations with very lowweight without having much eect on the resulting

induced classier However computation is reduced since it tends to b e prop ortional to the size of

the training sample regardless of weights

Ateach b o osting iteration training observations with weight w less than a threshold w t

i i

are not used to train the classier Wetake the value of t tobethe th quantile of the weight

distribution over the training data at the corresp onding iteration That is only those observations

that carry the fraction of the total weight mass are used for training Typically

so that the data used for training carries from to p ercent of the total weight mass Note that

the weights for al l training observations are recomputed at each iteration Observations deleted

at a particular iteration may therefore reenter at later iterations if their weights subsequently

increase relative to other observations

Figure left panel shows testerror rate as a function of iteration number for the letter

recognition problem describ ed in Section here using Gentle AdaBo ost and eight no de trees

as the base classier Two error rate curves are shown The black solid one represents using

the full training sample at each iteration whereas the blue dashed curve represents the

corresp onding error rate for The two curves track each other very closely esp ecially at

the later iterations Figure right panel shows the corresp onding fraction of observations used

to train the base classier as a function of iteration numb er Here the two curves are not similar

With the numb er of observations used for training drops very rapidly reaching roughly

of the total at iterations By iterations it is down to ab out where it stays throughout

the rest of the b o osting pro cedure Thus computation is reduced by over a factor of with no

apparent loss in classication accuracy The reason why sample size in this case decreases for

after iterations is that if all of the observations in a particular class are classied correctly

with very high condence F log N training for that class stops and continues only for

k

the remaining classes At iterations classes remained of the original classes

The last column lab eled fraction in Table for the letterrecognition problem shows the average

fraction of observations used in training the base classiers over the iterations for all b o osting

metho ds and tree sizes For eightno de trees all metho ds b ehaveasshown in Fig With stumps

LogitBo ost uses considerably less data than the others and is thereby corresp ondingly faster

This is a genuine prop erty of LogitBo ost that sometimes gives it an advantage with weight

trimming Unlike the other metho ds the LogitBo ost weights w p p do not in any way

i i i

involve the class outputs y they simply measure nearness to the currently estimated decision

i

b oundary F x Discarding small weights thus retains only those training observations that

M

are estimated to b e close to the b oundary For the other three pro cedures the weight is monotone

in y F x This gives highest weight to currently misclassied training observations esp ecially

i M i

Figure The left panel shows the test error for the letter recognition problem as a function of

iteration number The black solid curve uses al l the training data the red dashed curve uses a

subset based on weight thresholding The right panel shows the percent of training data used for

both approaches The upper curve steps down because training can stop for an entire class if it is

t suciently wel l see text

those far from the b oundary If after trimming the fraction of observations remaining is less than

the error rate the subsample passed to the base learner will b e highly unbalanced containing very

few correctly classied observations This imbalance seems to inhibit learning No suchimbalance

o ccurs with LogitBo ost since near the decision b oundary correctly and misclassied observations

app ear in roughly equal numbers

As this example illustrates very large reductions in computation for b o osting can b e achieved by

this simple trick Avariety of other examples not shown exhibit similar b ehavior with all b o osting

metho ds Note that other committee approaches to classication such as bagging Breiman

and randomized trees Dietterich while admitting parallel implementations cannot take

advantage of this approach to reduce computation

Further generalizatons of b o osting

We have shown ab ove that AdaBo ost ts an additive mo del optimizing a criterion similar to

binomial loglikeliho o d via an adaptive Newton metho d This suggests ways in which the b o osting

paradigm may b e generalized First the Newton step can b e replaced by a gradient step slowing

down the tting pro cedure This can reduce susceptibility to overtting and lead to improved

p erformance Second any smo oth loss function can be used for regression squared error is

natural leading to the tting of residuals b o osting algorithm mentioned in the intro duction But

other loss functions mighthave b enets for example tap ered squared error based on Hub ers robust

inuence function Hub er The resulting pro cedure is a fast convenient metho d for resistant

tting of additive mo dels Details of these generalizations may b e found in Friedman

Concluding remarks

In order to understand a learning pro cedure statistically it is necessary to identify two imp ortant

asp ects its structural mo del and its error mo del The former is most imp ortant since it determines

the function space of the approximator therebycharacterizing the class of functions or hyp otheses

that can b e accurately approximated with it The error mo del sp ecies the distribution of random

departures of sampled data from the structural mo del It thereby denes the criterion to be

optimized in the estimation of the structural mo del

We have shown that the structural mo del for b o osting is additive on the logistic scale with

the base learner providing the additive comp onents This understanding alone explains many

of the prop erties of b o osting It is no surprise that a large number of such jointly optimized

comp onents denes a much richer class of learners than one of them alone It reveals that in the

context of b o osting all base learners are not equivalent and there is no universally b est choice

over all situations As illustrated in Section the base learners need to be chosen so that the

resulting additive expansion matches the particular decision b oundary encountered Even in the

limited context of b o osting decision trees the interaction order as characterized by the number of

terminal no des needs to be chosen with care Purely additive mo dels induced by decision stumps

are sometimes but not always the best However we conjecture that b oundaries involving very

high order interactions will rarely b e encountered in practice This motivates our additive logistic

trees ALT pro cedure describ ed in Section

The error mo del for twoclass b o osting is the obvious one for binary variables namely the

Bernoulli distribution Weshow that the AdaBo ost pro cedures maximize a criterion that is closely

related to exp ected logBernoulli likeliho o d having the identical solution in the distributional

L limit of innite data We derived a more direct pro cedure for maximizing this loglikeliho o d

LogitBo ost and show that it exhibits prop erties nearly identical to those of Real AdaBo ost

In the multiclass case the AdaBo ost pro cedures maximize a separate Bernoulli likeliho o d for

each class versus the others This is a natural choice and is esp ecially appropriate when observations

can b elong to more than one class Schapire Singer In the more usual setting of a unique

class lab el for each observation the symmetric multinomial distribution is a more appropriate

error mo del We develop a multiclass LogitBo ost pro cedure that maximizes the corresp onding

loglikeliho o d by quasiNewton stepping We show through simulated examples that there exist

settings where this approach leads to sup erior p erformance although none of these situations seems

to have b een encountered in the set of real data examples used for illustration the p erformance of

b oth approaches had quite similar p erformance over these examples

The concepts develop ed in this pap er suggest that there is very little if any connection be

tween deterministic weighted b o osting and other randomized ensemble metho ds such as bagging

Breiman and randomized trees Dietterich In the language of least squares regression

the latter are purely variance reducing pro cedures intended to mitigate instability esp ecially that

asso ciated with decision trees Bo osting on the other hand seems fundamentally dierent It ap

p ears to b e mainly a bias reducing pro cedure intended to increase the exibility of stable highly

biased weak learners by incorp orating them in a jointly tted additive expansion

The distinction b ecomes less clear Breiman a when b o osting is implemented by nite

weighted random sampling instead of weighted optimization The advantagesdisadvantages of

intro ducing randomization into b o osting bydrawing nite samples is not clear If there turns out

to be an advantage with randomization in some situations then the degree of randomization as

reected by the sample size is an op en question It is not obvious that the common choice of using

the size of the original training sample is optimal in all or any situations

One fascinating issue not covered in this pap er is the fact that b o osting whatever avor seems

resistanttoovertting Some p ossible explanations are

As the LogitBo ost iterations pro ceed the overall impact of changes intro duced by f x

m

reduces Only observations with appreciable weight determine the new functions those

near the decision b oundary By denition these observations have F x near zero and can b e

aected bychanges while those in pure regions have large values of jF xj and are less likely

to b e mo died

The stagewise nature of the b o osting algorithms do es not allow the full collection of param

eters to be jointly t and thus has far lower variance than the full parameterization might

suggest In the Computational Learning Theory literature this is explained in terms of VC

dimension of the ensemble compared to that of each weak learner

Classiers are hurt less by overtting than other function estimators eg the famous risk

b ound of the nearestneighb or classier Cover Hart

Figure shows a case where b o osting do es overt The data are generated from two dimensional

spherical gaussians with the same mean and variances chosen so that the Bayes error is

samples p er class We used Real AdaBo ost and stumps the results were similar for all the b o osting

algorithms After ab out iterations the test error slowly increases

Schapire et al suggest that the prop erties of AdaBo ost including its resistance to over

tting can be understo o d in terms of classication margins However Breiman presents

evidence counter to this explanation Whatever the explanation the empirical evidence is strong

Figure Real AdaBoost stumps on a noisy concentricsphere problem with observations per

class and Bayes error The test error upper curve increases after about fty iterations

the intro duction of b o osting by Schapire Freund and colleagues has brought an exciting and im

portant set of new ideas to the table

Acknowledgments

We thank Andreas Buja for alerting us to the recent work on text classication at ATT lab o

ratories Bogdan Pop escu for illuminating discussions on PAC learning theory and Leo Breiman

and Rob ert Schapire for useful comments on an earlier version of this pap er We also thank two

anonymous referees and an asso ciate editor for detailed and useful comments on an earlier draft

of the pap er Jerome Friedman was partially supp orted by the Department of Energy under con

tract numb er DEACSF and bygrant DMS of the National Science Foundation

Trevor Hastie was partially supp orted bygrants DMS and DMS from the National

Science Foundation and grant ROICA from the National Institutes of Health Rob ert

Tibshirani was supp orted by the Natural Sciences and Engineering Research Council of Canada

References

Breiman L Bagging predictors Machine Learning

Breiman L Prediction games and arcing algorithms Technical Rep ort Technical Rep ort

Statistics Department University of California Berkeley Submitted to Neural Comput

ing

Breiman L a Arcing classiers with discussion Annals of Statistics

Breiman L b Combining predictors Technical rep ort Statistics Department Universityof

California Berkeley

Breiman L Friedman J Olshen R Stone C Classication and Regression Trees

Wadsworth Belmont California

Buja A Hastie T Tibshirani R Linear smo others and additive mo dels with discus

sion Annals of Statistics

Cover T Hart P Nearest neighb or pattern classication Proc IEEE Trans Inform

Theory pp

Dietterich T An exp erimental comparison of three metho ds for constructing ensembles of

decision trees bagging b o osting and randomization Machine Learning

Freund Y Bo osting a weak learning algorithm by ma jority Information and Computation

Freund Y Schapire R a Game theory online prediction and b o osting in Pro ceedings

of the Ninth Annual Conference on Computational Learning Theory pp

Freund Y Schapire R E b Exp eriments with a new b o osting algorithm in Machine

Learning Pro ceedings of the Thirteenth International Conference Morgan Kauman San

Francisco pp

Freund Y Schapire R E A decisiontheoretic generalization of online learning and an

application to b o osting Journal of Computer and System Sciences

Friedman J Multivariate adaptive regression splines with discussion Annals of Statistics

Friedman J Another approachtopolychotomous classication Technical rep ort Stanford

University

Friedman J Greedy function approximation the gradient b o osting machine Technical

rep ort Stanford University

Friedman J Stuetzle W Pro jection pursuit regression Journal of the American Sta

tistical Association

Hastie T Tibshirani R Generalized Additive Models Chapman and Hall

Hastie T Tibshirani R Classication by pairwise coupling Annals of Statistics

Hastie T Tibshirani R Buja A Flexible discriminant analysis by optimal scoring

Journal of the American Statistical Association

Holte R Very simple classication rules p erform well on most commonly used datasets

Machine Learning

Hub er P Robust estimation of a lo cation parameter Annals of Math Stat

Kearns M Vazirani U AnIntroduction to Computational Learning Theory MIT Press

Mallat S Zhang Z Matching pursuits with timefrequency dictionaries IEEE Trans

actions on Signal Processing

McCullagh P Nelder J Generalized Linear Models Chapman and Hall

Schapire R Using output co des to b o ost multiclass learning problems in Pro ceedings

of the Fourteenth International Conference on Machine Learning Morgan Kauman San

Francisco pp

Schapire R E The strength of weak learnability Machine Learning

Schapire R E Singer Y Improved b o osting algorithms using condencerated predic

tions in Pro ceedings of the Eleventh Annual Conference on Computational Learning Theory

Schapire R Freund Y Bartlett P Lee W Bo osting the margin a new explanation

for the eectiveness of voting metho ds Annals of Statistics

Valiant L G A theory of the learnable Communications of the ACM