Additive Logistic Regression: a Statistical View of
Bo osting
Jerome Friedman
Trevor Hastie
y
Robert Tibshirani
July 23, 1998
Abstract
Bo osting (Freund & Schapire 1996, Schapire & Singer 1998) is one
of the most imp ortant recent developments in classi cation metho d-
ology. The p erformance of many classi cation algorithms often can
b e dramatically improved by sequentially applying them to reweighted
versions of the input data, and taking a weighted ma jority vote of the
sequence of classi ers thereby pro duced. We show that this seemingly
mysterious phenomenon can b e understo o d in terms of well known
statistical principles, namely additive mo deling and maximum likeli-
ho o d. For the two-class problem, b o osting can b e viewed as an ap-
proximation to additive mo deling on the logistic scale using maximum
Bernoulli likeliho o d as a criterion. We develop more direct approx-
imations and show that they exhibit nearly identical results to that
of b o osting. Direct multi-class generalizations based on multinomial
likeliho o d are derived that exhibit p erformance comparable to other
recently prop osed multi-class generalizations of b o osting in most sit-
uations, and far sup erior in some. We suggest a minor mo di cation
to b o osting that can reduce computation, often by factors of 10 to
50. Finally, we apply these insights to pro duce an alternative formu-
lation of b o osting decision trees. This approach, based on b est- rst
truncated tree induction, often leads to b etter p erformance, and can
provide interpretable descriptions of the aggregate decision rule. It is
also much faster computationally making it more suitable to large scale
data mining applications.
Department of Statistics, Sequoia Hall, Stanford University, Stanford California 94305;
fjhf,[email protected]
y
Department of Public Health Sciences, and Department of Statistics, University of
Toronto; [email protected] 1
1 Intro duction
The starting p oint for this pap er is an interesting pro cedure called \b o ost-
ing", which is a way of combining or b o osting the p erformance of many
\weak" classi ers to pro duce a p owerful \committee". Bo osting was pro-
p osed in the machine learning literature (Freund & Schapire 1996) and has
since received much attention.
While b o osting has evolved somewhat over the years, we rst describ e the
most commonly used version of theAdaBoost pro cedure (Freund & Schapire
1996), which we call \Discrete" AdaBo ost. Here is a concise description
of AdaBo ost in the two-class classi cation setting. We have training data
(x ; y );::: (x ; y ) with x a vector valued feature and y = 1 or 1. We
1 1 N N i i
P
M
c f (x) where each f (x) is a classi er pro ducing val- de ne F (x) =
m m m
1
ues 1 and c are constants; the corresp onding prediction is sign(F (x)).
m
The AdaBo ost pro cedure trains the classi ers f (x) on weighted versions
m
of the training sample, giving higher weight to cases that are currently mis-
classi ed. This is done for a sequence of weighted samples, and then the
nal classi er is de ned to b e a linear combination of the classi ers from
each stage. We describ e the pro cedure in more detail in Algorithm 1
Discrete AdaBo ost(Freund & Schapire 1996)
1. Start with weights w = 1=N , i = 1;::: ;N .
i
2. Rep eat for m = 1; 2;::: ;M :
(a) Estimate the classi er f (x) from the training data with weights
m
w .
i
(b) Compute e = E [1 ], c = log ((1 e )=e ).
m w m m m
(y 6=f (x))
m
(c) Set w w exp[c 1 ]; i = 1; 2;::: N , and renormalize
i i m
(y 6=f (x ))
m
i i
P
w = 1. so that
i
i
P
M
c f (x)] 3. Output the classi er sign[
m m
m=1
Algorithm 1: E is the expectation with respect to the weights w =
w
(w ; w ; : : : w ). At each iteration AdaBoost increases the weights of the obser-
1 2 n
vations misclassi ed by f (x) by a factor that depends on the weighted training
m
error.
Much has b een written ab out the success of AdaBo ost in pro ducing accu-
rate classi ers. Many authors have explored the use of a tree-based classi er
for f (x) and have demonstrated that it consistently pro duces signi cantly
m 2
lower error rates than a single decision tree. In fact, Breiman (NIPS work-
shop, 1996) called AdaBo ost with trees the \b est o the shelf classi er in
the world". Interestingly, the test error seems to consistently decrease and
then level o as more classi ers are added, rather than ultimately increase.
For some reason, it seems that AdaBo ost is immune to over tting.
Figure 1 shows the p erformance of Discrete AdaBo ost on a synthetic
TM
classi cation task, using a adaptation of CART (Breiman, Friedman, Ol-
shen & Stone 1984) as the base classi er. This adaptation grows xed-size
trees in a \b est- rst" manner (see Section 7). Included in the gure is the
bagged tree (Breiman 1996) which averages trees grown on b o otstrap resam-
pled versions of the training data. Bagging is purely a variance-reduction
technique, and since trees tend to have high variance, bagging often pro duces
go o d results.
Early versions of AdaBo ost used a resampling scheme to implement
step 2 of Algorithm 1, by weighted imp ortance sampling from the train-
ing data. This suggested a connection with bagging, and that a ma jor
comp onent of the success of b o osting has to do with variance reduction.
However, b o osting p erforms comparably well when:
a weighted tree-growing algorithm is used in step 2 rather than weighted
resampling, where each training observation is assigned its weight w .
i
This removes the randomization comp onent essential in bagging.
\stumps" are used for the weak learners. Stumps are single-split trees
with only two terminal no des. These typically have low variance but
high bias. Bagging p erforms very p o orly with stumps (Fig. 1[top-right
panel].)
These observations suggest that b o osting is capable of b oth bias and variance
reduction, and thus di ers fundamentally from bagging.
The base classi er in Discrete AdaBo ost pro duces a classi cation rule
f (x) : X 7! f 1; 1g, where X is the domain of the predictive features
m
x. If the implementation of the base classi er cannot deal with observation
weights, weighted resampling is used instead. Freund & Schapire (1996) and
Schapire & Singer (1998) have suggested various mo di cations to improve
the b o osting algorithms; here we fo cus on a version due to Schapire & Singer
(1998), which we call \Real AdaBo ost", that uses real-valued \con dence-
rated" predictions rather than the f 1; 1g of Discrete AdaBo ost. The base
classi er for this generalized b o osting pro duces a mapping f (x) : X 7! R ;
m
the sign of f (x) gives the classi cation, and jf (x)j a measure of the \con-
m m
dence" in the prediction. This real-valued b o osting tends to p erform the 3 10 Node Trees Stumps
Bagging Discrete AdaBoost Real AdaBoost Test Error Test Error 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4
0 100 200 300 400 0 100 200 300 400
Number of Terms Number of Terms
100 Node Trees Test Error 0.0 0.1 0.2 0.3 0.4
0 100 200 300 400
Number of Terms
Figure 1: Test error for Bagging (BAG), Discrete AdaBoost (DAB) and Real
AdaBoost (RAB) on a simulated two-class nested spheres problem (see Section 5.)
There are 2000 training data points in 10 dimensions, and the Bayes error rate
is zero. Al l trees are grown \best- rst" without pruning. The left-most iteration
corresponds to a single tree. 4
b est in our simulated examples in Fig. 1, esp ecially with stumps, although
we see with 100 no de trees Discrete AdaBo ost overtakes Real AdaBo ost
after 200 iterations.
Real AdaBo ost(Schapire & Singer 1998)
1. Start with weights w = 1=N , i = 1; 2;::: ;N .
i
2. Rep eat for m = 1; 2;::: ;M :
(a) Estimate the \con dence rated" classi er f (x) : X 7! R and
m
the constant c from the training data with weights w .
m i
(b) Set w w exp[ c y f (x )]; i = 1; 2;::: N , and renormalize
i i m i m i
P
w = 1. so that
i
i
P
M
c f (x)] 3. Output the classi er sign[
m m
m=1
Algorithm 2: The Real AdaBoost algorithm al lows for the estimator f (x) to
m
range over R . In the special case that f (x) 2 f 1; 1g it reduces to AdaBoost, since
m
y f (x ) is 1 for a correct and 1 for an incorrect classi cation. In the general case
i m i
the constant c is absorbed into f (x). We describe the Schapire-Singer estimate
m m
for f (x) in Section 3.
m
Freund & Schapire (1996) and Schapire & Singer (1998) provide some
theory to supp ort their algorithms, in the form of upp er b ounds on gen-
eralization error. This theory (Schapire 1990) has evolved in the machine
learning community, initially based on the concepts of PAC learning (Kearns
& Vazirani 1994), and later from game theory (Freund 1995, Breiman 1997).
Early versions of b o osting \weak learners" (Schapire 1990) are far simpler
than those describ ed here, and the theory is more precise. The b ounds and
the theory asso ciated with the AdaBo ost algorithms are interesting, but
tend to b e to o lo ose to b e of practical imp ortance. In practice b o osting
achieves results far more impressive than the b ounds would imply.
In this pap er we analyze the AdaBo ost pro cedures from a statistical
p ersp ective. We show that the underlying mo del they are tting is addi-
tive logistic regression. The AdaBo ost algorithms are Newton metho ds for
optimizing a particular exp onential loss function | a criterion which b e-
haves much like the log-likeliho o d on the logistic scale. We also derive new
b o osting-like pro cedures for classi cation.
In Section 2 we brie y review additive mo delling. Section 3 shows how
b o osting can b e viewed as an additive mo del estimator, and prop oses some
new b o osting metho ds for the two class case. The multiclass problem is 5
studied in Section 4. Simulated and real data exp eriments are discussed
in Sections 5 and 6. Our tree-growing implementation, using truncated
b est- rst trees, is describ ed in Section 7. Weight trimming to sp eed up
computation is discussed in Section 8, and we end with a discussion in
Section 9.
2 Additive Mo dels
P
M
c f (x), although AdaBo ost pro duces an additive mo del F (x) =
m m
m=1
\weighted committee" or \ensemble" sound more glamorous. Additive mo d-
els have a long history in statistics, and we give some examples here.
2.1 Additive Regression Mo dels
We initially fo cus on the regression problem, where the resp onse y is quan-
titative, and we are interested in mo deling the mean E (Y jx) = F (x). The
additive mo del has the form
p
X
F (x) = f (x ): (1)
j j
j =1
Here there is a separate function f (x ) for each of the p input variables
j j
x . More generally, each comp onent f is a function of a small, pre-sp eci ed
j j
subset of the input variables. The back tting algorithm (Friedman & Stuetzle
1981, Buja, Hastie & Tibshirani 1989) is a convenient mo dular algorithm
for tting additive mo dels. A back tting up date is
3 2
X
5 4
: (2) f (x )jx f (x ) E y
j j j
k k
k 6=j
Any metho d or algorithm for estimating a function of x can b e used to
j
obtain an estimate of the conditional exp ectation in (2). In particular, this
can include nonparametric smoothing algorithms, such as lo cal regression
or smo othing splines. In the right hand side, all the latest versions of the
functions f are used in forming the partial residuals. The back tting cycles
k
are rep eated until convergence. Under fairly general conditions, back tting
2
can b e shown to converge to the minimizer of E (y F (x)) (Buja et al. 1989).
2.2 Extended Additive Mo dels
M
More generally, one can consider additive mo dels whose elements ff (x)g
m
1
are functions of p otentially all of the input features x. Usually, in this 6
context, the f (x) are taken to b e simple functions characterized by a set
m
of parameters and a multiplier ,
m
f (x) = b(x ; ): (3)
m m m
The additive mo del then b ecomes
M
X
b(x ; ): (4) F (x) =
m m M
m=1
t
For example, in single hidden layer neural networks b(x ; ) = ( x) where
() is a sigmoid function and parameterizes a linear combination of the
input features. In signal pro cessing, wavelets are a p opular choice with
parameterizing the lo cation and scale of a \mother" wavelet b(x ; ). In
M
are generally called \basis functions" since these applications fb(x ; )g
m
1
they span a function subspace.
If least{squares is used as a tting criterion, one can solve for an optimal
set of parameters through a generalized back{ tting algorithm with up dates
3 2
2
X
5 4
b(x ; ) b(x ; ) : (5) f ; g arg min E y
m m
k k
;
k 6=m
Alternatively, one can use a \greedy" forward stepwise approach
2
f ; g arg min E [y F (x) b(x ; )] (6)
m m m 1
;
m 1
where f ; g are xed at their corresp onding solution values at earlier
k k
1
iterations. This is the approach used by Mallat & Zhang (1993) in \matching
pursuit". There the b(x ; ) represent an over complete wavelet{like basis. In
the language of b o osting, f (x) = b(x ; ) would b e called a \weak learner"
and F (x) (4) the \committee". If decision trees were used as the weak
M
learner the parameters would represent the splitting variables, split p oints,
the constants in each terminal no de, and numb er of terminal no des of each
tree.
Note that the back{ tting pro cedure (5) or its greedy cousin (6) only
require an algorithm for tting a single weak learner (3) to data. This base
algorithm is simply applied rep eatedly to mo di ed versions of the original
data
X
f (x): y y
k
k 6=m 7
In the forward stepwise pro cedure (6) the mo di ed output at the mth iter-
ation y only dep ends on its value y and the solution f (x) at the
m m 1 m 1
previous iteration
y = y f (x): (7)
m m 1 m 1
At each step m, the previous output values y are mo di ed (7) so that
m 1
the previous mo del f (x) has no explanatory p ower on the new outputs
m 1
y . One can therefore view this as a pro cedure for b o osting a weak learner
m
f (x) = b(x ; ) to form a p owerful committee F (x) (4).
1 1 1 M
2.3 Classi cation problems
For the classi cation problem, we learn from Bayes theorem that all we
need is P (y = j jx), the p osterior or conditional class probabilities. One
could transfer all the ab ove regression machinery across to the classi cation
domain by simply noting that E (1 jx) = P (y = j jx), where 1 is the
(y =j ) (y =j )
0=1 indicator variable representing class j . While this works fairly well in
general, several problems have b een noted (Hastie, Tibshirani & Buja 1994)
for constrained regression metho ds. The estimates are typically not con ned
to [0; 1], and severe masking problems can o ccur. A notable exception is
when trees are used as the regression metho d, and in fact this is the approach
used by Breiman et al. (1984).
Logistic regression is a p opular approach used in statistics for overcoming
these problems. For a two class problem, the mo del is
M
X
P (y = 1jx)
f (x): (8) = log
m
P (y = 0jx)
m=1
The monotone logit transformation on the left guarantees that for any values
P
M
f (x) 2 R , the probability estimates lie in [0; 1]; inverting of F (x) =
m
m=1
we get
F (x)
e
: (9) p(x) = P (y = 1jx) =
F (x)
1 + e
Here we have given a general additive form for F (x); sp ecial cases exist
that are well known in statistics. In particular, the linear logistic regres-
sion mo del (McCullagh & Nelder 1989, for example) and additive logistic
regression mo del (Hastie & Tibshirani 1990) are p opular. These mo dels are
usually t by maximizing the binomial log-likeliho o d, and enjoy all the as-
so ciated asymptotic optimality features of maximum likeliho o d estimation.
A generalized version of back tting (2), called \Lo cal Scoring" in (Hastie
& Tibshirani 1990), is used to t the additive logistic mo del. Starting with 8
P
guesses f (x ) ::: f (x ), F (x) = f (x ) and p(x) de ned in (9), we form
1 1 p p
k k
the working resp onse:
y p(x)
z = F (x) + : (10)
p(x)(1 p(x))
We then apply back tting to the resp onse z with observation weights p(x)(1
p(x)) to obtain new f (x ). This pro cess is rep eated until convergence. The
k k
forward stage-wise version (6) of this pro cedure b ears a close similarity to
the LogitBo ost algorithm describ ed later in the pap er.
3 Bo osting | an Additive Logistic Regression Mo del
In this Section we show that the b o osting algorithms are stage-wise esti-
mation pro cedures for tting an additive logistic regression mo del. They
optimize an exp onential criterion which to second order is equivalent to
the binomial log-likeliho o d criterion. We then prop ose a more standard
likeliho o d-based pro cedure.
3.1 An Exp onential Criterion
Consider minimizing the criterion