<<

Proceedings of the Ninth Annual Conference on Computational Learning Theory, 1996.

Game Theory, On-line Prediction and Boosting

Yoav Freund Robert E. Schapire AT&T Laboratories 600 Mountain Avenue

Murray Hill, NJ 07974-0636

yoav, schapire ¡ @research.att.com

Abstract for learning to play repeated games based on the on-line prediction methods of Littlestone and Warmuth [15]. The We study the close connections between game the- analysis of this algorithmyields a new (as far as we know) and ory, on-line prediction and boosting. After a brief simple proof of von Neumann's famous minmax theorem, as review of , we describe an algorithm well as a provable method of approximately solving a game. for learning to play repeated games based on the In the last part of the paper we show that the on-line on-line prediction methods of Littlestone and War- prediction model is obtained by applying the game-playing muth. The analysis of this algorithm yields a sim- algorithm to an appropriate choice of game and that boosting ple proof of von Neumann's famous minmax theo- is obtained by applying the same algorithm to the ªdualº of rem, as well as a provable method of approximately this game. solving a game. We then show that the on-line pre- diction model is obtained by applying this game- 2 GAME THEORY playing algorithm to an appropriate choice of game We begin with a review of basic game theory. Further back- and that boosting is obtained by applying the same ground can be found in any introductorytext on game theory; algorithm to the ªdualº of this game. see for instance Fudenberg and Tirole [11]. We study two- person games in normal form. That is, each game is de®ned

by a matrix ¦ . There are two players called the row player 1 INTRODUCTION and column player. To play the game, the row player chooses

a row § , and, simultaneously, the column player chooses a

The purpose of this paper is to bring out the close connec- © § ¨ column ¨ . The selected entry is the loss suffered by tions between game theory, on-line prediction and boosting. the row player. Brie¯y, game theory is the study of games and other interac- For instance, the loss matrix for the children's game tions of various sorts. On-line prediction is a learning model ªRock, Paper, Scissorsº is given by: in which an agent predicts the classi®cation of a sequence of items and attempts to minimize the total number of prediction R P S R 1 errors. Finally, boosting is a method of converting a ªweakº 2 1 0 learning algorithm which performs only slightly better than P 0 1 1 random guessing into one that performs extremely well. 2 S 1 All three of these topics will be explained in more detail 1 0 2 below. All have been studied extensively in the past. In this The row player's goal is to minimize its loss. Often, the paper, the close relationship between these three seemingly goal of the column player is to maximize this loss, in which unrelated topics will be brought out. case the game is said to be ªzero-sum.º Most of our results Here is an outline of the paper. We will begin with a are given in the context of a zero-sum game. However, our review of game theory. Then we will describe an algorithm results also apply when no assumptions are made about the Home page: ªhttp://www.research.att.com/orgs/ssr/people/uidº. goal or of the column player. We return to this point Expected to change to ªhttp://www.research.att.com/Äuidº some- below.

For the sake of simplicity, we assume that all the losses ¥

time in the near future (for uid ¢¤£ yoav, schapire ).

 are in the range  0 1 . Simple scaling can be used to get more general results. Also, we restrict ourselves to the case Permission to make digital/hard copy of all or part of this mate- where the number of choices available to each player is ®nite. rial without fee is granted provided that copies are not made or However, most of the results translate with very mild addi- distributed for pro®t or commercial advantage, the ACM copy- tional assumptions to cases in which the number of choices right/server notice, the title of the publication and its date appear, is in®nite. For a discussion of in®nite matrix games see, for and notice is given that copying is by permission of the Association for Computing Machinery, Inc. (ACM). To copy otherwise, to re- instance, Chapter 2 in Ferguson [3]. publish, to post on servers or to redistribute to lists, requires prior speci®c permission and/or a fee. 2.1 RANDOMIZED PLAY We might go on naively to conjecture that the advantage of As described above, the players choose a single row or col- playing last is strict for some games so that, at least in some umn. Usually, this choice of play is allowed to be random- cases, the inequality in Eq. (2) is strict. ized. That is, the row player chooses a distribution over the Surprisingly, it turns out not to matter which player plays ®rst. Von Neumann's well-known minmax theorem states rows of ¦ , and (simultaneously) the column player chooses that the is the same in either case so that a distribution ¡ over columns. The row player's expected

loss is easily computed as

¨  

max min M P Q min max M P Q 3

¢¤£

T Q P P Q

§  ¦ § ¨ §¡ ¨ ©¨ ¦ ¡

¦  ¦ ¥ for every matrix . The common value of the two sides

For ease of notation, we will often denote this quantity by of the equality is called the value of the game ¦ . A proof of

the minmax theorem will be given in Section 2.5.

¡¤ ¦ , and refer to it simply as the loss (rather than In words, Eq. (3) means that the row player has a (min-

expected loss). In addition, if the row player chooses a dis- 

£

max) strategy such that regardless of the strategy ¡ played ¨

tribution but the column player chooses a single column , 

¡ 

by the column player, the loss suffered ¦ will be at

§ ¦ § ¨ 

then the (expected) loss is  which we denote

most  . Symmetrically, it means that the column player has

¦ ¨ ¦ § ¡¤ by . The notation is de®ned analogously. 

a (maxmin) strategy ¡ such that, regardless of the strategy § Individual (deterministically chosen) rows and columns

played by the row player the loss will be at least  . This

¨ 

are called pure strategies. Randomized plays de®ned by 

means that the strategies ¡ and are optimal in a strong distributions and ¡ over rows and columns are called sense. mixed strategies. The number of rows of the matrix ¦ will Thus, classical game theory says that given a (zero-sum) be denoted by  .

game ¦ , one should play using a minmax strategy. Such a 2.2 SEQUENTIAL PLAY strategy can be computed using linear programming. Up until now, we have assumed that the players choose their However, there are a number of problems with this ap- (pure or mixed) strategies simultaneously. Suppose now that proach. For instance,

instead play is sequential. That is, suppose that the column 

¦ may be unknown;

player chooses its strategy ¡ after the row player has chosen

and announced its strategy . Assume further that the column 

¦ may be so large that computing a minmax strategy player's goal is to maximize the row player's loss (i.e., that using linear programming is infeasible;

the game is zero-sum). Then given , such a ªworst-caseº 

or ªadversarialº column player will choose ¡ to maximize

the column player may not be truly adversarial and may

¡¤ ¦ ; that is, if the row player plays mixed strategy , behave in a manner that admits loss signi®cantly smaller then its payoff will be

than the game value  .

  max M P Q 1 Q Overcoming these dif®culties in the one-shot game is (It is understood here and throughout the paper that maxQ hopeless. But suppose instead that we are playing the game denotes maximum over all probability distributions over col- repeatedly. Then it is natural to ask if one can learn to play umns; similarly, minP will always denote minimum over well against the particular opponent that is being faced. all probability distributions over rows. These extrema exist because the set of distributionsover a ®nite space is compact.) 2.4 REPEATED PLAY Knowing this, the row player should choose to mini- Such a model of repeated play can be formalized as described mize Eq. (1), so the row player's loss will be below. To emphasize the roles of the two players, we refer

to the row player as the learner and the column player as the

 min max M P Q

P Q environment. 

A mixed strategy realizing this minimum is called a min- Let ¦ be a matrix, possibly unknown to the learner. The

max strategy. game is played repeatedly in a sequence of rounds. On round



     Suppose now that the column player plays ®rst and the ¨ 1 :

row player can choose its play with the bene®t of knowing  1. the learner chooses mixed strategy ; the column player's chosen strategy ¡ . Then by a symmetric argument, the loss of the row player will be 

2. the environment chooses mixed strategy ¡ (which may



 max min M P Q be chosen with knowledge of )

Q P





¡

§ ¡  and a realizing the maximum is called a maxmin strategy. 3. the learner is permitted to observe the loss ¦ for

each row § ; this is the loss it would have suffered had it 2.3 THE MINMAX THEOREM

played using pure strategy § ;

Intuitively, we expect the player who chooses its strategy last

 

¦ ¡  to have the advantage since it plays knowing its opponent's 4. the learner suffers loss . strategy exactly. Thus, we expect

The goal of the learner is to do almost as well as the best

  

max min M P Q min max M P Q 2

   ¡ Q P P Q strategy against the actual sequence of plays ¡ 1

2 which were chosen by the environment. That is, the learner's the average per-trial loss suffered by the learner is

goal is to suffer cumulative loss

¢ ¢

 

1  1

  ∆

¢ M P Q min M P Q

 

 



 P

¡ 

¦ 1 1  1 where

which is ªnot much worseº than the loss of the best strategy

 

in hindsight 2 ln  ln ln

¨  ¨!"#

  %$

¢



 min M P Q P 

1 Proof: See Section 2.2 in Freund and Schapire [9].

'& &)( Since ∆ 0 as  , we see that the amount by An algorithm for solving this problem can be derived by a which the average per-trial loss of the learner exceeds that direct generalization of Littlestone and Warmuth's ªweighted of the best mixed strategy can be made arbitrarily small for majority algorithmº [15], and is essentially equivalent to our

large  . earlier ªHedgeº algorithm [9]. The algorithm, called LW, For simplicity, the results in the remainder of the pa-

is quite simple. The learner maintains nonnegative weights

 

¡ per are based on Corollary 2 rather than Theorem 1. The

§ 

on the rows of ¦ ; let denote the weight at time on 

¡ details of the algorithm about which this corollary applies

§  ¨

row § . Initially, all the weights are set to unity: 1.



 are largely unimportant and could, in principle, be applied On each round , the learner computes mixed strategy by to any algorithm with similar properties. Indeed, algorithms normalizing the weights:

for this problem with similar properties were derived by Han-

 £

¡ 1

 

§ nan [13], Blackwell [1] and Foster and Vohra [6, 5, 4]. Also,



§ ©¨

¡ Fudenberg and Levine [10] independently proposed an algo-

§ 

 rithm equivalent to LW and proved a slightly weaker version



¡  § Then, given ¦ § for each , the learner updates the of Corollary 2. weights by the simple multiplicative rule:£ As a simple ®rst corollary, we see that the loss of LW can ∆

never exceed the value of the game ¦ by more than .

¦

£¢ 

¥

¡ ¡

¦¨§ ¡ ©

§  ¨ ¥¤ 1 §

¦ Corollary 3 Under the conditions of Corollary 2,



Here,  0 1 is a parameter of the algorithm.

The main theorem concerning this algorithm is the fol- ¢ 

1 

∆   * M P Q

lowing:   1

Theorem 1 For any matrix M with  rows and entries in

where  is the value of the game M.

     0 1 , and for any sequence of mixed strategies Q1 Q

played by the environment, the sequence of mixed strategies  ¦

¦ Proof: Let be a minmax strategy for so that for all



  

P1 P produced by algorithm LW with parameter

¦ ¡  

column strategies ¡ , . Then, by Corollary 2,



 0 1 satisfy:

¢ ¢



  

1 1

¢ ¢

¡   ¦ ¦ ¡  ∆ ∆  *

  

 

 

  

M P Q min M P Q ln 

 P 1 1

1 1

where ¦ Note that in the analysis we made no assumption about the

 

ln 1 1

¦ ¦

 

 ¨ ¨

 strategy used by the environment. Theorem 1 guarantees that  1  1 its cumulative loss is not much larger than that of any ®xed mixed strategy. As shown above, this implies, in particular, Proof: The proof follows directly from Theorem 2 of Freund that the loss cannot be much larger than the game value. and Schapire [9], which in turn is a simple and direct general- However, if the environment is non-adversarial, there might ization of Littlestone and Warmuth [15]. For completeness, be a better ®xed mixed strategy for the player, in which case we provide a short proof in the appendix. ¦ the algorithm is guaranteed to be almost as good as this better

As approaches 1,  also approaches 1. In addition, ¦ strategy.

for ®xed and as the number of rounds  becomes large,

  the second term  ln becomes negligible (since it is ®xed)

¦ 2.5 PROOF OF THE MINMAX THEOREM relative to  . Thus, by choosing close to 1, the learner can ensure that its loss will not be much worse than the loss of the More interestingly, Corollary 2 can be used to derive a very best strategy. This is formalized in the following corollary: simple proof of von Neumann's minmax theorem. To prove this theorem, we need to show that

Corollary 2 Under the conditions of Theorem 1 and with ¦

    min max M P Q max min M P Q 4 set to P Q Q P 1 1However, Hannan's algorithm requires prior knowledge of the

2 ln  

1 entire game matrix.

3

  ¡ (Proving that minP maxQ M P Q maxQ minP M P Q is strategy has the additional favorable property of being

relatively straightforward and so is omitted.) sparse in the sense that at most  of its entries will be nonzero. Suppose that we run algorithm LW against a maximally Viewing LW as a method of approximately solving a adversarial environment which always chooses strategies game will be central to our derivation of a boosting algorithm which maximize the learner's loss. That is, on each round  , (Section 4).

the environment chooses Similar and closely related methods of approximately 

 solving linear programming problems have previously ap-

¨   Q arg max M P Q 5 Q peared, for instance, in the work of Plotkin, Shmoys and

Tardos [16].

 



1  1

¨ ¡ ¨ ¡ ¡  

Let 1 and 1 . Clearly, and are probability distributions. 3 ON-LINE PREDICTION Then we have: Since the game-playing algorithm LW presented in Sec- min max PTMQ tion 2.4 is a direct generalization of the on-line prediction P Q T algorithm of Littlestone and Warmuth [15], it is not surpris-

 max P MQ Q ing that an on-line prediction algorithm can be derived from

the more general game-playing algorithm by an appropriate

¢ ¦ 1  T choice of game . In this section, we make this connection

¨ max P MQ by de®nition of P  explicit. Q 

1 In the on-line prediction model, ®rst introduced by Lit-

¢ tlestone [14], the learner observes a sequence of examples 1  T

 max P MQ and predicts their labels one at a time. The learner's goal is 

 Q

1 to minimize its prediction errors.

¡ ¢

Formally, let be a ®nite set of instances, and let be a

¢

 

1 

& &

¡ ¡  ¡ ¡ T ®nite set of hypotheses £ : 0 1 . Let : 0 1 ¨ P MQ by de®nition of Q

 2

 ¢

be an unknown target concept, not necessarily in . 1

In the on-line prediction model, learning takes place in a



¢

¨    

1 T  sequence of rounds. On round 1 :



 min P MQ ∆ by Corollary 2

 

P 

¡

1 1. the learner observes an example ¤ ;



T ¥

¨  ¡ min P MQ by de®nition of Q 2. the learner makes a randomized prediction à 0 1 P 

of the label associated with ¤ ;

T

  ∆

max min P MQ 

¦¤  Q P 3. the learner observes the correct label  .

Since ∆ can be made arbitrarily close to zero, this proves The goal of the learner is to minimize the expected number Eq. (4) and the minmax theorem. of mistakes that it makes relative to the best hypothesis in

the space ¢ . (The expectation here is with respect to the 2.6 APPROXIMATELY SOLVING A GAME learner's own randomization.) Thus, we ask that the learner

Aside from yielding a proof for a famous theorem that by perform well whenever the target  is ªcloseº to one of the

now has many proofs, the preceding derivation shows that hypotheses in ¢ . algorithm LW can be used to ®nd an approximate minmax or It is straightforward now to reduce the on-line prediction maxmin strategy. Finding these ªoptimalº strategies is called problem to a special case of the problem. The

solving the game ¦ . environment's choice of a column corresponds to a choice

¡ Skipping the ®rst inequality of the sequence of equalities of an instance ¤ that is presented to the learner on a

and inequalities above, we see that given iteration. The learner's choice of a row corresponds

¢

to choosing a speci®c hypothesis £ and predicting the

   ∆ ¨   ∆

max M P Q max min M P Q

¦¤  Q Q P label £ . A mixed strategy for the learner corresponds

to making a random choice of a hypothesis with which to

Thus, the vector is an approximate minmax strategy in the predict. In this reduction the environment uses only pure

¡ ¦ ¡  ¢¨§

sense that for all column strategies , does not strategies. The game matrix thus has § rows, indexed by

 ∆ ∆

¢ § ¡©§ ¤ ¡

exceed the game value by more than . Since can £ and columns, indexed by . The matrix ¤ be made arbitrarily small, this approximation can be made entry that is associated with hypothesis £ and instance is

arbitrarily tight.

¦¤  ¨  ¦¤ 

1 if £

£ ¤  ¨ Similarly, ignoring the last inequality of this derivation, ¦ 0 otherwise.

we have that

   min M P Q ∆ 2 P As was said above, much of this analysis can be generalized to in®nite sets. The cardinality of the set of examples is actually so ¡ also is an approximate maxmin strategy. Furthermore, it

 of no real consequence. Littlestone and Warmuth [15] generalize can be shown that ¡ satisfying Eq. (5) can always be chosen their results to countably in®nite sets of hypotheses, and Freund and to be a pure strategy (i.e., a mixed strategy concentrated on Schapire [9] and Freund [8] give generalizations to uncountably

a single column of ¦ ). Therefore, the approximate maxmin in®nite sets of hypotheses.

4

¦£ ¤  £ ¡ ¢

Thus, ¦ is 1 if and only if disagrees with the target As in Section 3, let be a space of instances, a space

¤  

 on instance . We call this a mistake matrix. of hypotheses, and the target concept. For 0, we say

¢   The application of the algorithm LW described in Sec- that algorithm WL is a  -weak learning algorithm for

3 ¡

tion 2.4 to the on-line prediction problem is as follows. We if, for any distribution ¡ over the set , the algorithm takes

 ¡

apply the algorithm to mistake matrix ¦ . On round , given as input a set of labeled examples distributed according to

 

£ ¢  

instance ¤ , LW provides us with a distribution over rows and outputs a hypothesis with error at most 1 2 , ¦

¤ 1

¦ ¢

§

£ ¤  ¨ ¦¤   

of (i.e., over hypothesis space ). We randomly select i.e., Pr 

Q     

2

¥

¢ ¨ £ ¦¤ 

£ according to , and predict à . Next, Given a weak learning algorithm, the goal of boosting

 

¦¤  ¦ ¦£ ¤  £ ¢

given  , we compute for each and up- is to run the weak learning algorithm many times on many 

date the weights maintained by LW. (Here, the strategy ¡ distributions, and to combine the selected hypotheses into a  is simply the pure strategy concentrated on the ¤ column of ®nal hypothesis with arbitrarily small error rate. For the pur-

¦ .) poses of this paper, we simplify the boosting model further

For the analysis, note that to require that the ®nal hypothesis have error zero so that all

¢

  

 instances are correctly classi®ed. The algorithm presented

¤  ¨ £  £ ¤ 

M P P M can certainly be modi®ed to ®t the more standard (and prac- ¢¡¢£

¦ tical) model in which the ®nal error must be less than some

 

¨ £ ¦¤  ¨ ¤ ¨§ 

¥¤Pr positive parameter (see Freund and Schapire [9] for more

P © 4

¦ details).

 



¥

§

¨ ¨ ¤ 

   

Pr à Thus, boosting proceeds in rounds. On round ¨ 1 :

 ¡ Therefore, the expected number of mistakes made by the 1. the booster constructs a distribution  on which is

learner equals passed to the weak learner;



¢ ¢ ¢

2. the weak learner produces a hypothesis £ with

  

¤   ¦ ¦ £ ¤  ! ¢  § ¢¨§

©¡¢£



min ln  

 error at most 1 2 :

¦

1 1  1

£ ¤  ¨  ¤ §   

¤Pr

 2 by a direct application of Corollary 2 (for an appropriate ©

choice of ¦ ). Thus, the expected number of mistakes made

£    £

by the learner cannot exceed the number of mistakes made After  rounds, the weak hypotheses 1 are com- 

bined into a ®nal hypothesis £ .

¢ !  § ¢¨§

by the best hypothesis in by more than ln . The important issues for designing a boosting algorithm 

A more careful analysis (using Theorem 1 rather than are: (1) how to choose distributions  , and (2) how to 

Corollary 2) gives a better bound identical to that obtained by combine the £ 's into a ®nal hypothesis. Littlestone and Warmuth [15] (not surprisingly). Still better bounds using more sophisticated methods were obtained by 4.1 BOOSTING AND THE MINMAX THEOREM Cesa-Bianchi et al. [2] and Vovk [18]. Before describing our boosting algorithm, let us step back for This result can be straightforwardly generalized to any a moment to consider the relationship between the mistake

bounded loss function (such as square loss rather than zero- matrix ¦ used in Section 3 and the minmax theorem. This one mistake loss), and also to a setting in which the learner relationship will turn out to be highly relevant to the design competes against a set of experts rather than a ®xed set of and understanding of the boosting algorithm.

hypotheses. (See, for instance, Cesa-Bianchi et al. [2] and Recall that the mistake matrix ¦ has rows and columns

Freund and Schapire [9].) indexed by hypotheses and instances, respectively, and that

¦£ ¤ ¨ £ ¦¤  ¨  ¤ 

¦ 1 if and is zero otherwise. As-

¢    4 BOOSTING suming is -weakly learnable (so that there exists a

 -weak learning algorithm), what does the minmax theorem

The third topic of this paper is boosting. Boosting is the

¦  say about ¦ ? Suppose that the value of is . Then

problem of converting a ªweakº learning algorithm that per-

¤  ¨  forms just slightly better than random guessing into one that min max M P min max M P Q

performs with arbitrarily good accuracy. The ®rst provably P  P Q 

effective boosting algorithm was discovered by Schapire [17]. ¨

 ¨ max min M P Q Freund [7] subsequently presented a much improved boosting Q P

algorithm which is optimal in particular circumstances. The

¨ ¦£ 

max min M Q (6) boosting algorithm derived in this section is closely related Q to Freund and Schapire's more recent ªAdaBoostº boosting algorithm [9]. 4The standard boosting model usually also includes a ªcon®-

denceº parameter   0 which bounds the probability of the boost- 3The reduction is not speci®c to the use of LW. Other algorithms ing algorithm failing to ®nd a ®nal hypothesis with low error. This for playing repeated games can be combined with this reduction to parameter is necessary if we assume that the weak learner only suc- give on-line learning algorithms. However, these algorithms need to ceeds with high probability. However, because we here make the

be capable of working without complete knowledge of the matrix. simplifying assumption that the weak learner always succeeds in !#" It should be suf®cient for the algorithm to receive as input only ®nding a weak hypothesis with error at most 1 2 , we have no the identity and contents of columns that have been chosen by the need of a con®dence parameter and instead require that the boosting environment in the past. algorithm succeed with absolute certainty.

5

  

(It is straightforward to show that, for any ¡ , minP M P Q being in the range 0 1 , we add the constant 1 to every out-

¢

¤ ¦

is realized at a pure strategy £ . Similarly for and .) come, which has no effect on the game. Thus, the dual ¦

Note that, by ¦ 's de®nition, of is simply ¢

¦ T

¦ ¨¤£  ¦

¦£  ¨ £ ¦¤  ¨  ¤ §

M Q Pr¤  Q where 1 is an all 1's matrix of the appropriate dimensions.

Therefore, the right hand part of Eq. (6) says that there exists In the case of the mistake matrix ¦ , the dual now has



¡ £ a distribution ¡ on such that for every hypothesis ,

¦ rows and columns indexed by instances and hypotheses, re-



¤

§

 ¨  ¦£  £ ¤  ¨ ¤ 

M Q Pr Q . However, because spectively, and each entry is 

we assume -weak learnability, there must exist a hypothesis

¢

£ ¤  ¨ ¦¤ 

£ such that 1 if

¤ £  ¨  ¦ ¦£ ¤  ¨ ¦ 1

¦ 1 0 otherwise.

§

£ ¦¤  ¨ ¦¤     ¤ Pr 2  Q

Note that any minmax strategy of the game ¦ becomes a

¢

   ¦ Combining these facts gives that  1 2 . maxmin strategy of the game . Therefore, whereas before

On the other hand, the left part of Eq. (6) implies that we were interested in ®nding an approximate minmax strat-

 ¦

there exists a distribution over the hypothesis space ¢ egy of , we are now interested in ®nding an approximate

¢

¡ ¦

such that for every ¤ : maxmin strategy of .

¢

¦

 ¦

1 ¡ 1 We can now apply algorithm LW to game matrix

§

¤  ¨  £ ¦¤  ¨ ¤   

¤ M P Pr P 2 2 since, by the results of Section 2.6, this will lead to the con-

struction of an approximate maxmin strategy. The reduction

 

That is, every instance ¤ is misclassi®ed by less than 1 2 of  proceeds as follows: On round of boosting

the hypotheses (as weighted by ). Therefore, the target



concept  is functionally equivalent to a weighted majority 1. algorithm LW computes a distribution over rows of ¢

of hypotheses in ¢ . ¡ ¦ (i.e., over );

To summarize this discussion, we have argued that if

  

¢    

¨  are -weakly learnable, then can be computed ex- 2. the boosting algorithm sets  and passes to

actly as a weighted majority of hypotheses in ¢ . Moreover, the weak learning algorithm; 

the weights used in this function (de®ned by distribution 

above) are not just any old weights, but rather are a minmax 3. the weak learner returns a hypothesis £ satisfying

¦

¦ 

strategy for the game . 1

§

£ ¤ ¨  ¤   ¤Pr 2 ;

A similar proof technique was previously used by Gold-  © mann, HastadÊ and Razborov [12] to prove a result about the

representation power of circuits of weighted threshold gates. 4. the weights maintained by algorithm LW are updated

  £ where ¡ is de®ned to be the pure strategy . 4.2 IDEA FOR BOOSTING

The idea of our boosting algorithm then is to approximate  According to the method of approximately solving a game

 

by approximating the weights of this function. Since these given in Section 2.6, on each round , ¡ may be a pure

 £

weights are a minmax strategy of the game ¦ , we might strategy and should be chosen to maximize

¢

¦

¢

hope to apply the method described in Section 2.4 for ap- ¢

    

§

£  ¨ ¦¤  ¦¤ £  ¨ £ ¤ ¨ ¦¤ 

proximately solving a game. M P P M Pr¤ 

P © The problem is that the resulting algorithm does not ®t  the boosting model. Recall that on each round, algorithm LW 

In other words, £ should have maximum accuracy with re-  computes a distributionover the rows of the game matrix (hy- spect to distribution . This is exactly the goal of the weak potheses, in the case of matrix ¦ ). However, in the boosting learner. (Although it is not guaranteed to succeed in ®nding

model, we want to compute on each round a distributionover 

   the best £ , ®nding one of accuracy 1 2 turns out to be instances (columns of ¦ ).

suf®cient for our purposes.)



Since we have an algorithm which computes distributions 

¨   § ¡ over rows, but need one that computes distributions over Finally, this method suggests that ¡ 1 1 is columns, the obvious solution is to reverse the roles of rows an approximate maxmin strategy, and we know that the target

 is equivalent to a majority of the hypotheses if weighted by a ¢

and columns. This is exactly the approach that we follow.  ¡

maxmin strategy of ¦ . Since is in our case concentrated ¦ That is, rather than using game directly, we construct the 

on pure strategy (hypothesis) £ , this leads us to choose a ®nal

dual of ¦ which is the identical game except that the roles



£    £

of the row and column players have been reversed. hypothesis £ which is the (simple) majority of 1 .

¢ ¦ Constructing the dual ¦ of a game is straightfor- 4.3 ANALYSIS ward. First, we need to reverse row and column so we take T

the transpose ¦ . This, however, is not enough since the Indeed, the resulting boosting procedure will compute a ®nal



£  

column player of ¦ wants to maximize the outcome, but hypothesis identical to for suf®ciently large . We ¢

the row player of ¦ wants to minimize the outcome (loss). show in this section how this follows from Corollary 2.

Therefore, we also need to reverse the meaning of minimum As noted earlier, for all  ,

¦

¢

  

and maximum which is easily done by negating the matrix 1

§

£ ¤  ¨ ¦¤    £  ¨

T M P Pr¤ ¦

 2  yielding . Finally, to adhere to our convention of losses P ©

6  Therefore, by Corollary 2, Input: instance space ¡ and target

 -weak learning algorithm

¡ ¢

¢ 4

¢ ¢

  

∆   ¨ § ¡ § ¢

1 1 1 Set ¡ 2 ln (so that ).

  ¦ £  ¦ ¤ £  ∆

 min

¦

 

2 

 

¨   § ¡ §   

1 1 Set 1 1 2 ln .

¤ ¨  § ¡©§ ¤ ¡

Let  1 1 for . 

Therefore, for all ¤ ,

¨    

For 1 :





¢



¢ 

Pass distribution to weak learner.

1 1 1

¦¤ £      

¦ ∆ 7 

2 2 





1 Get back hypothesis £ such that ¦

 1

  £ ¤  ¨  ¤ §  

where the last inequality holds for suf®ciently large (specif- ¤Pr ¢

 2

©

¡

¦

ically, when ∆  ). Note that, by de®nition of ,

¢

 









¦¤ £  £ ¦ is exactly the number of hypotheses which

1 Update  :

 ¤

agree with on instance . Therefore, in words, Eq. (7) says



¦





 ¢

 ¦¤ 

£ ¦¤  ¨ ¦¤  ¤

that more than half the hypotheses £ are correct on . There- if

 ¤

¦¤  ¨ £

 1

 £  ¤  ¨  ¤ 

fore, by de®nition of £ , we have that for 1 otherwise

¤  all . £

¡ where is a normalization constant (chosen so that

 ¢  For this to hold, we need only that ∆ , which will

Ω 2  1 will be a distribution). ¨ § ¡ § ¢  be the case for  ln .

The resulting boosting algorithm, in which the game-



¨ £    £  Output ®nal hypothesis £ MAJ 1 . playing subroutine LW has been ªcompiled outº is shown in

Fig. 1. The algorithm is actually quite intuitive in this form: Figure 1: The boosting algorithm. 

after each hypothesis £ is observed, the weight associated

 £ with each instance ¤ is decreased if is correct on that instance and otherwise is increased. Thus, each distribution focuses on the examples most likely to be misclassi®ed by References the preceding hypotheses. [1] David Blackwell. An analog of the theorem In practice, of course, the booster would not have access for vector payoffs. Paci®c Journal of Mathematics, to the labels associated with the entire domain ¡ . Rather, 6(1):1±8, Spring 1956. the booster would be given a labeled training set and all dis- tributions would be computed over the training set. The gen- [2] NicoloÁ Cesa-Bianchi, Yoav Freund, David P. Helmbold, eralization error of the ®nal hypothesis can then be bounded David Haussler, Robert E. Schapire, and Manfred K. using, for instance, standard ªVC theoryº (see Freund and Warmuth. How to use expert advice. In Proceedings of Schapire [9] for more details). the Twenty-Fifth Annual ACM Symposium on the Theory A more sophisticated version of this algorithm, called of Computing, pages 382±391, 1993. AdaBoost, is given by Freund and Schapire [9]. The advan- tage of this version is that the learner does not need to know [3] Thomas S. Ferguson. Mathematical Statistics: A Deci- a priori the minimum accuracy rate of each weak hypothesis. sion Theoretic Approach. Academic Press, 1967. [4] Dean Foster and Rakesh Vohra. Regret in on-line deci- 5 SUMMARY sion making. unpublished manuscript, 1996. In sum, we have shown how the two well-studied learning [5] Dean Foster and Rakesh V. Vohra. Asymptotic calibra- problems of on-line prediction and boosting can be cast in a tion. unpublished manuscript, 1995. single game-theoretic framework in which the two seemingly very different problems can be viewed as ªdualsº of one [6] Dean P. Foster and Rakesh V. Vohra. A randomiza- another. tion rule for selecting forecasts. Operations Research, We hope that the insight offered by this connection will 41(4):704±709, July±August 1993. help in the development and understanding of such learn- ing algorithms since an algorithm for one problem may, in [7] Yoav Freund. Boosting a weak learning algorithm by principle, be translated into an algorithm for the other. As majority. Information and Computation, 121(2):256± a concrete example, the boosting algorithm described in this 285, 1995. paper was derived from Littlestone and Warmuth's weighted [8] Yoav Freund. Predicting a binary sequence almost as majority algorithm by following this dual connection. well as the optimal biased coin. In Proceedings of the Ninth Annual Conference on Computational Learning Acknowledgments Theory, 1996. Thanks to Manfred Warmuth for many helpful discussions, [9] Yoav Freund and Robert E. Schapire. A decision- and to Dean Foster and Rakesh Vohra for pointing us to theoretic generalization of on-line learning and an appli- relevant literature. cation to boosting. In Computational Learning Theory:

7 Second European Conference, EuroCOLT '95, pages Next, note that, for any ¨ ,

23±37. Springer-Verlag, 1995. A draft of the journal



¢

£

¦

¢ ¢

¦ ¥

¦¨§ ¡ ©

¡ ¡ ¢¡

version is available electronically (on our web pages, or  ¤£

© 1

 ¨  ¨ by email request). 1 § 1 1 [10] Drew Fudenberg and David K. Levine. Consistency and Combining with Eq. (8) and taking logs gives

cautious ®ctitious play. Journal of Economic Dynamics

and Control, 19:1065±1089, 1995. ¢

¦



 ¦ ¨ ¡ 

ln 

[11] Drew Fudenberg and . Game Theory. MIT 1

Press, 1991.

¢

¦

 

    ¦ ¡    ln ln 1 1 [12] Mikael Goldmann, Johan Hastad,Ê and Alexander 

Razborov. Majority gates vs. general weighted thresh- 1

old gates. Computational Complexity, 2:277±300, ¢

¦

 

    ¦ ¡   ln 1

1992. 

1

[13] James Hannan. Approximation to Bayes risk in repeated

¡

 ¤   ¤ ¤ play. In M. Dresher, A. W. Tucker, and P. Wolfe, editors, since ln 1 for 1. Rearranging terms, and

Contributions tothe Theory of Games, volume III, pages noting that this expression holds for any ¨ gives

97±139. Princeton University Press, 1957.

¢ ¢

  

¡   ¦ ¨ ¡   

¦ min ln

 

[14] Nick Littlestone. Learning when irrelevant attributes ¦

abound: A new linear-threshold algorithm. Machine 1 1 Learning, 2:285±318, 1988. Since the minimum (over mixed strategies ) in the bound

of the theorem must be achieved by a pure strategy ¨ , this [15] Nick Littlestone and Manfred K. Warmuth. The implies the theorem. weighted majority algorithm. Information and Com- putation, 108:212±261, 1994. [16] Serge A. Plotkin, David B. Shmoys, and Ev a Tardos. Fast approximation algorithms for fractional packing and covering problems. Mathematics of Operations Research, 20(2):257±301, May 1995. [17] Robert E. Schapire. The strength of weak learnability. Machine Learning, 5(2):197±227, 1990. [18] VolodimirG. Vovk. Aggregating strategies. In Proceed- ings of the Third Annual Workshop on Computational Learning Theory, pages 371±383, 1990.

A PROOF OF THEOREM 1



£

    

For ¨ 1 , we have that

 

¢ ¢

£ £

¦

 ¢ 

¥

¡ ¡

©

¦¨§ ¡

§  ¨ §  ¤

1

1 1



¢

£

¦

 

¡

§  ¤   ¦ § ¡    1 1

1



¢

£

¦

  

¡

§  ¤   ¦ ¡  ¨

" 1 1 $

1

£¢

¡

§ 

The ®rst line uses the de®nition of 1 . The second line

¦ ¦ ¦



   ¤ 

follows from the fact that  1 1 for 0 and



  ¤ 0 1 . The last line uses the de®nition of .

Unwrapping this simple recurrence gives



¢

£

¦

¢  

¡

§    ¤   ¦  ¡ 

1 1 1 8



1 1

¡

§  ¨ (Recall that 1 1.)

8