Fictitious Play

Multi-agent learning Fictitious Play

Multi-agent learning

Gerard Vreeswijk, Intelligent Systems Group, Computer Science Department, Faculty of Sciences, Utrecht University, The Netherlands.

Gerard Vreeswijk. Last modiﬁed on February 27th, 2012 at 18:35 Slide 1 Multi-agent learning Fictitious Play

Fictitious play: motivation

Rather than considering your the most important, •

own payoffs, monitor the representative of a

behaviour of your opponent(s), . and respond optimally. Behaviour of an opponent is •

projected on a

. Brown (1951): explanation for • Nash equilibrium play. In terms of current use, the name is a bit of a misnomer, since play actually occurs (Berger, 2005). One of the most important, if not •

Gerard Vreeswijk. Last modiﬁed on February 27th, 2012 at 18:35 Slide 2 Multi-agent learning Fictitious Play

Plan for today Part I. Best reply strategy 1. Pure fictitious play. 2. Results that connect pure fictitious play to Nash equilibria. Part II. Extensions and approximations of fictitious play 1. Smoothed fictitious play. 2. Exponential regret matching. 3. No-regret property of smoothed fictitious play (Fudenberg et al., 1995). 4. Convergence of better reply strategies when players have limited memory and are inert [tend to stick to their current strategy] (Young, 1998).

Shoham et al. (2009): Multi-agent Systems. Ch. 7: “Learning and Teaching”.

H. Young (2004): Strategic Learning and it Limits, Oxford UP.

D. Fudenberg and D.K. Levine (1998), The Theory of Learning in Games, MIT Press.

Gerard Vreeswijk. Last modiﬁed on February 27th, 2012 at 18:35 Slide 3 Multi-agent learning Fictitious Play

Part I: Pure fictitious play

Gerard Vreeswijk. Last modiﬁed on February 27th, 2012 at 18:35 Slide 4 Multi-agent learning Fictitious Play

Repeated Coordination Game

Players receive payoff p > 0 iff they coordinate. This game possesses three Nash equilibria, viz. (0, 0), (0.5, 0.5), and (1, 1).

Round A’s action B’s action A’s beliefs B’s beliefs 0. (0.0, 0.0) (0.0, 0.0) 1. L* R* (0.0, 1.0) (1.0, 0.0) 2. R L (1.0, 1.0) (1.0, 1.0) 3. L* R* (1.0, 2.0) (2.0, 1.0) 4. R L (2.0, 2.0) (2.0, 2.0) 5. R* R* (2.0, 3.0) (2.0, 3.0) 6. R R (2.0, 4.0) (2.0, 4.0) 7. R R (2.0, 5.0) (2.0, 5.0) ......

Gerard Vreeswijk. Last modiﬁed on February 27th, 2012 at 18:35 Slide 5 Multi-agent learning Fictitious Play

Steady states are pure (but possibly weak) Nash equilibria

Deﬁnition (Steady state). An action proﬁle a is a (or

) of ﬁctitious play if it is the case that whenever a is played at round t then, inevitably, it is also played at round t + 1.

Theorem. If a pure strategy profile is a steady state of fictitious play, then it is a (possibly weak) Nash equilibrium in the stage game. Proof . Suppose a =(a1,..., an) is a steady state. Consequently, i’s opponent i model converges to a− , for all i. By definition of fictitious play, i plays best i responses to a− , i.e., i i i : a BR(a− ). ∀ ∈ The latter is precisely the definition of a Nash equilibrium.

Still, the resulting Nash equilibrium is often strict, because for weak equilibria the process is likely to drift due to alternative best responses.

Gerard Vreeswijk. Last modiﬁed on February 27th, 2012 at 18:35 Slide 6 Multi-agent learning Fictitious Play

Pure strict Nash equilibria are steady states

Theorem. If a pure strategy profile is a strict Nash equilibrium of a stage game, then it is a steady state of fictitious play in the repeated game. Notice the use of terminology: “pure strategy profile” for Nash equilibria; “action profile” for steady states.

Proof . Suppose a is a pure Nash equilibrium and ai is played at round t, for all i. Because a is strict, ai is the unique best response to a i. Because this − argument holds for each i, action proﬁle a will be played in round t + 1 again.

Summary of the two theorems: Pure strict Nash Steady state Pure Nash. ⇒ ⇒ But what if pure Nash equilibria do not exist?

Gerard Vreeswijk. Last modiﬁed on February 27th, 2012 at 18:35 Slide 7 Multi-agent learning Fictitious Play

Repeated game of Matching Pennies

Zero sum game. A’s goal is to have pennies matched. B maintains opposite.

Round A’s action B’s action A’s beliefs B’s beliefs 0. (1.5, 2.0) (2.0, 1.5) 1. T T (1.5, 3.0) (2.0, 2.5) 2. T H (2.5, 3.0) (2.0, 3.5) 3. T H (3.5, 3.0) (2.0, 4.5) 4. H H (4.5, 3.0) (3.0, 4.5) 5. H H (5.5, 3.0) (4.0, 4.5) 6. H H (6.5, 3.0) (5.0, 4.5) 7. H T (6.5, 4.0) (6.0, 4.5) 8. H T (6.5, 5.0) (7.0, 4.5) ......

Gerard Vreeswijk. Last modiﬁed on February 27th, 2012 at 18:35 Slide 8 Multi-agent learning Fictitious Play

Convergent empirical distribution of strategies

Theorem. If the empirical distribution of each player’s strategies converges in fictitious play, then it converges to a Nash equilibrium. Proof . Same as before. If the empirical distributions converge to q, then i’s i opponent model converges to q− , for all i. By definition of fictitious play, qi BR(q i). Because of convergence, all such (mixed) best replies remain the ∈ − same. By definition we have a Nash equilibrium.

Remarks: 3. If empirical distributions 1. The qi may be mixed. converge (hence, converge to a Nash equilibrium), the actually 2. It actually sufﬁces that the q i − played responses per stage need converge asymptotically to the not be Nash equilibria of the actual distribution (Fudenberg & stage game. Levine, 1998).

Gerard Vreeswijk. Last modiﬁed on February 27th, 2012 at 18:35 Slide 9 Multi-agent learning Fictitious Play

Empirical distributions converge to Nash stage Nash 6⇒ Repeated Coordination Game. Players receive payoff p > 0 iff they coordinate. Round A’s action B’s action A’s beliefs B’s beliefs 0. (0.5, 1.0) (1.0, 0.5) 1. B A (1.5, 1.0) (1.0, 1.5) 2. A B (1.5, 2.0) (2.0, 1.5) 3. B A (2.5, 2.0) (2.0, 2.5) 4. A B (2.5, 3.0) (3.0, 2.5) ...... This game possesses three equilibria, viz. (0, 0), (0.5, 0.5), and (1, 1), with • expected payoffs 1, 0.5, and 1, respectively. Empirical distribution of play converges to (0.5, 0.5),—with payoff 0, • rather than p/2.

Gerard Vreeswijk. Last modiﬁed on February 27th, 2012 at 18:35 Slide 10 Multi-agent learning Fictitious Play

Empirical distribution of play does not need to converge Rock-paper-scissors. Winner receives payoff p > 0. Else, payoff zero.

Rock-paper-scissors with these payoffs is known as the . • The Shapley game possesses one equilibrium, viz. (1/3,1/3,1/3), with • expected payoff p/3.

Round A’s action B’s action A’s beliefs B’s beliefs 0. (0.0, 0.0, 0.5) (0.0, 0.5, 0.0) 1. Rock Scissors (0.0, 0.0, 1.5) (1.0, 0.5, 0.0) 2. Rock Paper (0.0, 1.0, 1.5) (2.0, 0.5, 0.0) 3. Rock Paper (0.0, 2.0, 1.5) (3.0, 0.5, 0.0) 4. Scissors Paper (0.0, 3.0, 1.5) (3.0, 0.5, 1.0) 5. Scissors Paper (0.0, 4.0, 1.5) (3.0, 0.5, 2.0) ......

Gerard Vreeswijk. Last modiﬁed on February 27th, 2012 at 18:35 Slide 11 Multi-agent learning Fictitious Play

Repeated Shapley Game: Phase Diagram

Scissors

• N

N N Rock Paper

Gerard Vreeswijk. Last modiﬁed on February 27th, 2012 at 18:35 Slide 12 Multi-agent learning Fictitious Play

Part II: Extensions and approximations of fictitious play

Gerard Vreeswijk. Last modiﬁed on February 27th, 2012 at 18:35 Slide 13 Multi-agent learning Fictitious Play

Proposed extensions to ﬁctitious play

Build forecasts, not on complete history, but on Recent data, say on m most recent rounds. • Discounted data, say with discount factor γ. • Perturbed data, say with error ǫ on individual observations. • Random samples of historical data, say on random samples of size m. • Give not necessarily best responses, but ǫ-greedy. • Perturbed throughout, with small random shocks. • Randomly, and proportional to expected payoff. •

Gerard Vreeswijk. Last modiﬁed on February 27th, 2012 at 18:35 Slide 14 Multi-agent learning Fictitious Play

Framework for predictive learning (like ﬁctitious play)

A for player i is a function that maps a history to a probability distribution over the opponents’ actions in the next round:

fi : H ∆(X i). → −

A for player i is a function that maps a history to a probability distribution over i’s own actions in the next round: g : H ∆(X ). i → i

A for player i is the combination of a forecasting rule and a response rule. This is typically written as ( fi, gi). This framework can be attributed to J.S. Jordan (1993). • Forecasting and response functions are deterministic. • Reinforcement and regret do not ﬁt. They are not involved with • prediction.

Gerard Vreeswijk. Last modiﬁed on February 27th, 2012 at 18:35 Slide 15 Multi-agent learning Fictitious Play

Forecasting and response rules for ﬁctitious play Let ht Ht be a history of play up to and including round t and ∈ jt φ =Def the empirical distribution of j’s actions up to and including round t.

Then the is given by t jt fi(h )=Def ∏ φ . j=i 6

Let fi be a ﬁctitious play forecasting rule. Then gi is said to be a

if all values are best responses to values of fi. Remarks: 1. Player i attributes a mixed strategy φjt to player j. This strategy reﬂects the number of times each action is played by j. 2. The mixed strategies are assumed to be independent. 3. Both (1) and (2) are simplifying assumptions.

Gerard Vreeswijk. Last modiﬁed on February 27th, 2012 at 18:35 Slide 16 Multi-agent learning Fictitious Play

Smoothed fictitious play Notation: i p− : strategy profile of opponents as predicted by fi in round t. i i ui(xi, p− ) : expected utility of action xi, given p− . i q : strategy profile of player i in round t + 1. I.e., gi(h). i i i Task: define q given p− and ui(xi, p− ). Idea: Respond randomly, but (somehow) proportional to expected payoff.

Elaborations of this idea:

a) Strictly proportional: b) Through, what is called, :

i u (x ,p i)/γ ui(xi, p− ) e i i − i qi(x p i)= . i i i − Def i q (xi p− )=Def i . | ∑x X ui(x′, p− ) | ui(xi′,p− )/γi i′ i i ∑x X e ∈ i′∈ i

Gerard Vreeswijk. Last modiﬁed on February 27th, 2012 at 18:35 Slide 17 Multi-agent learning Fictitious Play

Mixed logit, or quantal response function Let d + + d = 1 and d 0. • 1 ··· n j ≥ edi/γ logit(di)=Def dj/γ ∑j e where γ > 0.

The logit function can be seen as a on n variables. • γ 0 : logit “shares” 1 among all maximal d ↓ i γ = 1 : logit is strictly proportional γ ∞ : logit “spreads” 1 among all d evenly → i Mixed logit can be justiﬁed in different ways.

a) On the basis of information and arguments.

b) By assuming the dj are i.i.d. (a.k.a. ) .

Anderson et al. (1992): Discrete Choice Theory of Product Differentiation. Sec. 2.6.1: “Derivation of the Logit”.

Gerard Vreeswijk. Last modiﬁed on February 27th, 2012 at 18:35 Slide 18 Multi-agent learning Fictitious Play

Evenly (γ ∞) mixed logit best response only (γ 0) → −→ −→ ↓

As you see, mixed logit respects best replies, but leaves room for experimentation.

Gerard Vreeswijk. Last modiﬁed on February 27th, 2012 at 18:35 Slide 19 Multi-agent learning Fictitious Play

Digression: Coding theory and entropy This digression tries to answer the following question:

Why does play according to a diversiﬁed strategy yields more information than play according to a strategy where only a few options are played?

To send 8 different binary If some messages are send more • encoded messages would cost 3 frequently than others, it pays off to bits. Encoded messages are 000, search for a code such that messages 001,... 111. that occur more frequently are To encode 16 different messages, represented by short code words (at • the expense of messages that are send we would need log2 16 = 4 bits. less frequently, that must then be To encode 20 different messages, • represented by the remaining longer we would need log 20 = ⌈ 2 ⌉ code words). 4.32 = 5 bits. ⌈ ⌉

Gerard Vreeswijk. Last modiﬁed on February 27th, 2012 at 18:35 Slide 20 Multi-agent learning Fictitious Play

Coding theory and entropy (continued)

Example. Suppose persons A and B work on a dark terrain. They are separated, and can only communicate by morse through a ﬂashlight.

A and B have agreed to send only the following messages: A possible encoding could be

m Yes : m 00 1 1 ↔ m No m 01 2 2 ↔ m All well? m 10 3 3 ↔ m Shall I come over? m 11 4 4 ↔

Gerard Vreeswijk. Last modiﬁed on February 27th, 2012 at 18:35 Slide 21 Multi-agent learning Fictitious Play

Coding theory and entropy (continued)

Another encoding could be Under Code 3, the sequence 0101 may mean different things, such as : m1 0 ↔ m1, m2, m1, m2, or m1, m2, m4. (There m2 10 ↔ are still other possibilities.) m3 110 ↔ The objective is to search for an m4 111 •

↔ , i.e., an encoding To prevent ambiguity, no code word that minimises the number of bits may be a preﬁx of some other code per message. word.

If the of messages A useless coding would be • is known, we can for every code

: m 0 1 ↔ compute the expected number of m 1 2 ↔ bits per message, m 00 3 ↔ hence its efﬁciency. m 01 4 ↔

Gerard Vreeswijk. Last modiﬁed on February 27th, 2012 at 18:35 Slide 22 Multi-agent learning Fictitious Play

Coding theory and entropy [end of digression] The following would be a plausible Theorem (Noiseless Coding Theo- probability distribution: rem, Shannon) m1 Yes 1/2 p log (1/p )+ . . . + pn log (1/pn) m2 No 1/4 1 2 1 2 m3 All well? 1/8 is a lower bound for the expected m4 Shall I come over? 1/8 number of bits in an encoding of For Code 2, n messages with expected occur- rence (p1,..., pn). E[number of bits]=

1 1 1 1 This number is called the of 1 + 2 + 3 + 3 = 1.75 2 · 4 · 8 · 8 · (p1,..., pn). Alternatively, entropy is [p log (p )+ + p log (p )]. For Code 1, the expected number of − 1 2 1 ··· n 2 n bits is 2.0. Therefore, Code 2 is more The entropy of Code 2 is equal to efﬁcient than Code 1. 1.75. Therefore, Code 2 is optimal.

Gerard Vreeswijk. Last modiﬁed on February 27th, 2012 at 18:35 Slide 23 Multi-agent learning Fictitious Play

Smoothed ﬁctitious play (Fudenberg & Levine, 1995) Smoothed ﬁctitious play is a generalisation of mixed logit. Let w : ∆ R i i → be a function that “grades” i’s probability distributions (over actions) under the following conditions.

1. Grading is smooth (wi is inﬁnitely often differentiable). 2. Grading is strictly concave (bump) in such a manner that w (qi) ∞ k∇ i k → (steep) whenever grading approaches the boundary of ∆i (whenever distributions become extremely uneven). Let i i i i i U (q , p− )= u (q , p− )+ γ w (q ) i Def i i· i Let fi be ﬁctitious forecasting and let gi correspond to a best response based

on Ui. Then ( fi, gi) is called with smoothing function

wi and γi.

Gerard Vreeswijk. Last modiﬁed on February 27th, 2012 at 18:35 Slide 24 Multi-agent learning Fictitious Play

Smoothed fictitious play limits regret Theorem (Fudenberg & Levine, 1995). Let G be a finite game and let ǫ > 0. If a given player uses smoothed fictitious play with a sufficiently small smoothing parameter, then with probability one his regrets are bounded above by ǫ. – Young does not reproduce the proof of Fudenberg et al., but shows that in this case ǫ-regret can be derived from a later and more general result of Hart and Mas-Colell in 2001. – This later result identifies a large family of rules that eliminate regret, based on an extension of Blackwell’s approachability theorem. (Roughly, Blackwell’s approachability theorem generalises maxmin reasoning to vector-valued payoffs.)

Fudenberg & Levine, 1995. “Consistency and cautious ﬁctitious play,” Journal of Economic Dynamics and Control, Vol. 19 (5-7), pp. 1065-1089.

Hart & Mas-Colell, 2001. “A General Class of Adaptive Strategies,” Journal of Economic Theory, Vol. 98(1), pp. 26-54.

Gerard Vreeswijk. Last modiﬁed on February 27th, 2012 at 18:35 Slide 25 Multi-agent learning Fictitious Play

Smoothed ﬁctitious play converges to ǫ-CCE

Deﬁnition. A (CCE) is a probability distribution on strategy proﬁles, q ∆(X), such that no player can opt out (to gain ∈ expected utility) before q is made known.

In a ǫ (ǫ-CCE), no player can opt out to gain more in expectation than ǫ.

Theorem (Fudenberg & Levine, 1995). Let G be a finite game and let ǫ > 0. If all players use smoothed fictitious play with sufficiently small smoothing parameters, then with probability one empirical play will converge to the set of coarse correlated ǫ-equilibria. Summary of the two theorems: smoothed fictitious play limits regret and converges to ǫ-CCE. There is another learning method with no regret and convergence to zero-CCE . . .

Gerard Vreeswijk. Last modiﬁed on February 27th, 2012 at 18:35 Slide 26 Multi-agent learning Fictitious Play

There are more Coarse Correlated Equilibria than Correlated Equilibria than Nash Equilibria

Simple coordination game: Other: Left Right

Left (1, 1) (0, 0) Right (0, 0) (1, 1)

(In this picture, CCE = CE.)

Gerard Vreeswijk. Last modiﬁed on February 27th, 2012 at 18:35 Slide 27 Multi-agent learning Fictitious Play

Exponentiated regret matching

Let j : action j, where 1 j k t ≤ ≤ u¯i : i’s realised average payoff up to and including round t it φ− : the realised joint empirical distribution of i’s opponents it it u¯i(j, φ− ) : i’s hypothetical average payoff for playing action j against φ− r¯it : player i’s regret vector in round t, i.e., u¯ (j, φ it) u¯t i − − i

(PY, p. 59) is deﬁned as i(t+1) ∝ it a qj [r¯j ]+ where a > 0. (For a = 1 we have ordinary regret matching.) An extended theorem on regret matching (Mas-Colell et al., 2001) ensures that individual players have no-regret with probability one, and empirical distribution of play converges to the set of coarse correlated equilibria (PY, p. 60).

Gerard Vreeswijk. Last modiﬁed on February 27th, 2012 at 18:35 Slide 28 Multi-agent learning Fictitious Play

FP vs. Smoothed FP vs. Exponentiated RM Fictitious play Plays best responses.

Does depend on of opponent(s). • Puts zero probabilities on sub-optimal responses. • Smoothed ﬁctitious play Plays sub-optimal responses, e.g., softmax-proportional to their estimated payoffs.

Does depend on of opponent(s). • Puts non-zero probabilities on sub-optimal responses. • Approaches ﬁctitious play when γ 0 (PY, p. 84). • i ↓ Exponentiated regret matching Plays regret suboptimally, i.e., proportional to a power of positive regret.

Does depend on own . • Puts non-zero probabilities on sub-optimal responses. • Approaches ﬁctitious play when exponent a ∞ (PY, p. 84). • →

Gerard Vreeswijk. Last modiﬁed on February 27th, 2012 at 18:35 Slide 29 Multi-agent learning Fictitious Play

FP vs. Smoothed FP vs. Exponentiated RM

FP Smoothed FP Exponentiated RM Depends on past play of √ √ − opponents Depends on own past payoffs √ − − Puts zero probabilities on √ − − sub-optimal responses Best response √ when smoothing when exponent parameter γ 0 a ∞ i ↓ → Individual no-regret Within ǫ > 0, almost Exact, almost − always (PY, p. 82) always (PY, p. 60) Collective convergence to coarse Within ǫ > 0, almost Exact, almost − correlated equilibria always (PY, p. 83) always (PY, p. 60)

Gerard Vreeswijk. Last modiﬁed on February 27th, 2012 at 18:35 Slide 30 Multi-agent learning Fictitious Play

Part III: Finite memory and inertia

Gerard Vreeswijk. Last modiﬁed on February 27th, 2012 at 18:35 Slide 31 Multi-agent learning Fictitious Play

Finite memory: motivation

In their basic version, most • learning rules rely on the entire history of play. People, as well as computers, • have a ﬁnite memory. (On the other hand, for average or discounted payoffs this is no real problem.) Nevertheless: experiences in the • distant past are apt to be less relevant than more recent ones.

: let players have a ﬁnite • memory of m rounds.

Gerard Vreeswijk. Last modiﬁed on February 27th, 2012 at 18:35 Slide 32 Multi-agent learning Fictitious Play

Inertia: motivation

When players’ strategies are • constantly re-evaluated, discontinuities in behaviour are likely to occur. Example: the asymmetric coordination game. Discontinuities in behaviour are • less likely to lead to equilibria of any sort.

: let players play the same • action as in the previous round with probability λ.

Gerard Vreeswijk. Last modiﬁed on February 27th, 2012 at 18:35 Slide 33 Multi-agent learning Fictitious Play

Weakly acyclic games

Game G with action space X. • (4, 2) (2, 4) (1, 1) G =(V, E) where V = X and • ′ E = (x, y) for some i : { | y i = x i and ui(yi, y i) > − − − ui(xi, y i) (2, 4) (4, 2) (1, 1) − } For all x X: x is a sink iff x is a • ∈ Nash equilibrium.

G is said to be •

if every node is (1, 1) (1, 1) (3, 3) connected to a sink.

Nash equilibrium. • ⇒∃

Gerard Vreeswijk. Last modiﬁed on February 27th, 2012 at 18:35 Slide 34 Multi-agent learning Fictitious Play

Examples of weakly acyclic games

Coordination games Two-person games with identical actions for all players, where best responses are formed by the diagonal of the joint action space.

Potential games (Monderer and Shapley, 1996).

There is a function ρ : X R, called the , → such that for every player i and every action proﬁle x, y X: ∈ y i = x i ui(yi, y i) ui(xi, x i)= ρ(y) ρ(x) − − ⇒ − − − −

Example: . true : The potential function increases along every path. : Paths cannot cycle. ⇒ : In ﬁnite graphs, paths must end. ⇒

Gerard Vreeswijk. Last modiﬁed on February 27th, 2012 at 18:35 Slide 35 Multi-agent learning Fictitious Play

Weakly acyclic games under finite memory and inertia Theorem. Let G be a finite weakly acyclic n-person game. Every better-reply process with finite memory and inertia converges to a pure Nash equilibrium of G.

Proof . (Outline.) the overall probability to play any m action is bounded away from zero. 1. Let the Z be X . 2. A state x¯ Xm is called 5. It can easily be seen that ∈

if it consists of Absorbing = Z∗ Pure Nash. identical action proﬁles x. Such a ∩ state is denoted by x . h i 6. In a moment, it will be shown that, Z = homogeneous states . due to weak acyclicity, inertia, and ∗ Def { } 3. In a moment, it will be shown that (4), the process eventually lands in an absorbing state which, due to the process will hit Z∗ inﬁnitely often. (5), is a repeated pure Nash equilibrium. 4. In a moment, it will be shown that

Gerard Vreeswijk. Last modiﬁed on February 27th, 2012 at 18:35 Slide 36 Multi-agent learning Fictitious Play

First claim: process hits Z∗ inﬁnitely often Let inertia be determined by λ > 0.

Pr(all players play their previous action)= λn.

Hence,

Pr(all players play their previous action during m subsequent rounds)= λnm.

If all players play their previous action during m subsequent rounds, then the process arrives at a homogeneous state. But also conversely. Hence, for all t,

Pr(process arrives at a homogeneous state in round t + m)= λnm.

Inﬁnitely many disjoint histories of length m occur, hence inﬁnitely many independent events “homogeneous at t + m” occur.

Apply the (ﬁrst) : if En n are independent events, and ∞ { } ∑n=1 Pr(En) is unbounded, then Pr( an inﬁnite number of En ’s occur )= 1.

Gerard Vreeswijk. Last modiﬁed on February 27th, 2012 at 18:35 Slide 37 Multi-agent learning Fictitious Play

Second claim: all actions will be played with probability γ > 0 A better-reply learning method from states (finite histories) to strategies (probability distributions on actions) γ : Z ∆(X ) i → i possesses the following important properties: i) It is deterministic. ii) Every action is played with positive probability. 1. Hence, if γ = inf γ (z)(x ) z Z, x X i { i i | ∈ i ∈ i} Since Z and Xi are finite, the “inf” is a “min,” and γi > 0. 2. Similarly, if γ = inf γ 1 i n . { i | ≤ ≤ } Since there are finitely many players, the “inf” is a “min,” and γ > 0.

Gerard Vreeswijk. Last modiﬁed on February 27th, 2012 at 18:35 Slide 38 Multi-agent learning Fictitious Play

Final claim: overall probability to reach a sink from Z∗ > 0 Suppose the process is in x . Edge x y is actually traversed : h i • n 1 → 1. If x is pure Nash, we are done, γλ − . because response functions are Proﬁle y is maintained for • deterministic better replies. another m 1 rounds, so as to − arrive at state y : λn(m 1). 2. If x is not pure Nash, there must h i − be an edge x y in the better To traverse from x to y : → • h i h i reply graph. Suppose this edge γλn 1 λn(m 1) = γλnm 1. − · − − concerns action xi of player i. We The image x(1) ,..., x(l) of a • h i h i now know that xi is played with better reply-path x(1),..., x(l) is probability at least γ, irrespective followed to a sink : (γλnm 1)L, ≥ − of player and state. where L is the length of a longest Further probabilities: better-reply path.

All other players j = i keep Since Z∗ is encountered inﬁnitely • 6 n 1 playing the same action : λ − . often, the result follows.

Gerard Vreeswijk. Last modiﬁed on February 27th, 2012 at 18:35 Slide 39 Multi-agent learning Fictitious Play

Summary

With ﬁctitious play, the inertia converges to a pure Nash • is modelled by (or equilibrium. represented by, or projected on)

a . Fictitious play ignores • sub-optimal actions. There is a family of so-called •

, that i) play sub-optimal actions, and ii) can be brought arbitrary close to ﬁctitious play. In weakly acyclic n-person • games, every better-reply process with ﬁnite memory and

Gerard Vreeswijk. Last modiﬁed on February 27th, 2012 at 18:35 Slide 40 Multi-agent learning Fictitious Play

What next? Bayesian play :

With ﬁctitious play, the behaviour of opponents is modelled by a •

With Bayesian play, opponents are modelled by a •

. Gradient dynamics : Like ﬁctitious play, players model (or assess) each other through mixed • strategies. Strategies are not played, only maintained. • Due to CKR (common knowledge of rationality, cf. Hargreaves Heap & • Varoufakis, 2004), all models of mixed strategies are correct. (I.e., i i q− = s− , for all i.) Players gradually adapt their mixed strategies through hill-climbing in • the payoff space.

Gerard Vreeswijk. Last modiﬁed on February 27th, 2012 at 18:35 Slide 41