Multi-agent learning Repeated games

Multi-agent learning Repeated games

Gerard Vreeswijk, Intelligent Systems Group, Computer Science Department, Faculty of Sciences, Utrecht University, The Netherlands.

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 1 Multi-agent learning Repeated games

Repeated games: motivation

1. Much interaction in multi-agent systems can be modelled through games. 2. Much learning in multi-agent systems can therefore be modelled through learning in games. 3. Learning in games usually takes place through the (gradual) adaption of strategies (hence, behaviour) in a . 4. In most repeated games, one game (a.k.a. stage game) is played repeatedly. Possibilities: • A finite number of times. • An indefinite (same: indeterminate) number of times. • An infinite number of times. 5. Therefore, familiarity with the basic concepts and results from the theory of repeated games is essential to understand multi-agent learning.

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 2 Multi-agent learning Repeated games

Plan for today

• NE in normal form games that are repeated a finite number of times. – Principle of .

• NE in normal form games that are repeated an indefinite number of times. – Discount factor. Models the probability of continuation. – Folk theorem. (Actually many FT’s.) Repeated games generally do have infinitely many Nash equilibria. – Trigger , on-path vs. off-path play, the threat to “minmax” an opponent. This presentation draws heavily on (Peters, 2008).

* H. Peters (2008): : A Multi-Leveled Approach. Springer, ISBN: 978-3-540-69290-4. Ch. 8: Repeated games.

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 3 Multi-agent learning Repeated games

Example 1: Nash equilibria in playing the PD twice

Other: Prisoners’ Dilemma Cooperate Defect You: Cooperate (3, 3) (0, 5) Defect (5, 0) (1, 1)

• Even if mixed strategies are allowed, the PD possesses one , viz. (D, D) with payoffs (1, 1). • This equilibrium is Pareto sub-optimal. (Because (3, 3) makes both players better off.) • Does the situation change if two parties get to play the Prisoners’ Dilemma two times in succession? • The following diagram (hopefully) shows that playing the PD two times in succession does not yield an essentially new NE.

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 4 Multi-agent learning Repeated games

Example 1: Nash equilibria in playing the PD twice ( 2 )

(0, 0)

CC CD DC DD (3, 3) (0, 5) (5, 0) (1, 1)

CC CD DC DD CC CD DC DD CC CD DC DD CC CD DC DD (6, 6) (3, 8) (8, 3) (4, 4) (3, 8) (0, 10) (5, 5) (1, 6) (8, 3) (5, 5) (10, 0) (6, 1) (4, 4) (1, 6) (6, 1) (2, 2)

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 5 Multi-agent learning Repeated games

Example 1: Nash equilibria in playing the PD twice ( 3 )

In normal form: Other: CC CD DC DD You: CC (6, 6) (3, 8) (3, 8) (0, 10) CD (8, 3) (4, 4) (5, 5) (1, 6) DC (8, 3) (5, 5) (4, 4) (1, 6) DD (10, 0) (6, 1) (6, 1) (2, 2)

• The action profile (DD, DD) is the only Nash equilibrium. • With 3 successive games, we obtain a 23 × 23 matrix, where the action profile (DDD, DDD) still would be the only Nash equilibrium. • Generalise to N repetitions: (DN, DN) still is the only Nash equilibrium in a repeated game where the PD is played N times in succession.

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 6 Multi-agent learning Repeated games

Backward induction (version for repeated games)

• Suppose G is a game in normal form for p players, where all players possess the same arsenal of possible actions A = {a1,..., am}.

• The game Gn arises by playing the stage game G a number of n times in succession. • A history h of length k is an element of (Ap)k, e.g., for p = 3 and k = 10,

a7 a5 a3 a6 a1 a9 a2 a7 a7 a3 a6 a9 a2 a4 a2 a9 a9 a1 a1 a4 a1 a2 a7 a9 a6 a1 a1 a8 a2 a4 is a history of length ten in a game with three players. The set of all kp possible histories is denoted by H. (Hence, |Hk| = m .) • A (possibly mixed) strategy for one player is a function H → Pr(A).

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 7 Multi-agent learning Repeated games

Backward induction (version for repeated games)

• For some repeated games of length n, the dominating (read: “clearly best”) strategy for all players in round n (the last round) does not depend on the history of play. E.g., for the Prisoners’ Dilemma in last round: “No matter what happened in rounds 1... n − 1, I am better off playing D.” • Fixed strategies (D, D) in round n determine play after round n − 1. • Independence on history, plus a determined future, leads to the following justification for playing D in round n − 1: “No matter what happened in rounds 1... n − 2 (the past), and given that I will receive a payoff of 1 in round n (the future), I am better off playing D now.” • Per induction in round k, where k ≥ 1: “No matter what happened in rounds 1... k, and given that I will receive a payoff of (n − k) · 1 in rounds (k + 1) . . . n, I am better off playing D in round k.”

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 8 Multi-agent learning Repeated games

Indefinite number of repetitions • A Pareto-suboptimal outcome can be avoided in case the following three conditions are met. 1. The Prisoners’ Dilemma is repeated an indefinite number of times (rounds). 2. A so-called discount factor δ ∈ [0, 1] determines the probability of continuing the game after each round. 3. The probability to continue, δ, must be large enough. • Under these conditions suddenly infinitely many Nash equilibria exist. This is sometimes called an embarrassment of richness (Peters, 2008). • Various Folk theorems state the existence of multiple equilibria in infinitely repeated games.a • We now informally discuss one version of “the” Folk Theorem. aFolk Theorems are named such, because their exact origin cannot be traced.

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 9 Multi-agent learning Repeated games

Example 2: Prisoners’ Dilemma repeated indefinitely

• Consider the game G∗(δ) where the PD is played a number of times in ∗ δ succession. We write G ( ) : G0, G1, G2,... . • The number of times the stage game is played is determined by a parameter 0 ≤ δ ≤ 1. The probability that the next stage (and the stages thereafter) will be played is δ. t Thus, the probability that stage game Gt will be played is δ . (What if t = 0?)

• The PD (of which every Gt is an incarnation) is called the stage game, as opposed to the overall game G∗(δ). • A history h of length t of a repeated game is a sequence of action profiles of length t. • A realisation h is a countably infinite sequence of action profiles.

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 10 Multi-agent learning Repeated games

Example 2: Prisoners’ Dilemma repeated indefinitely ( 2 ) • Example of a history of length t = 10:

Row player: CDDDCCDDDD Column player: CDDDDDDCDD 0123456789

• The set of all possible histories (of any length) is denoted by H.

• A (mixed) strategy for Player i is a function si : H → Pr({C, D}) such that

Pr( Player i plays C in round |h| + 1 | h )= si(h)(C).

• A strategy profile s is a combination of strategies, one for each player. • The expected payoff for player i given s can be computed. It is ∞ δt Expected payoffi(s)= ∑ Expected payoffi,t(s). t=0

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 11 Multi-agent learning Repeated games

Example: The expected payoff of a stage game Prisoners’ Dilemma Other: Cooperate Defect You: Cooperate (3, 3) (0, 5) Defect (5, 0) (1, 1) • Suppose following strategy profile for one game: – Row player (you) plays with mixed strategy 0.8 on C (hence, 0.2 on D). – Column player (other) plays with mixed strategy 0.7 on C. • Your expected payoff is 0.8(0.7· 3 + 0.3 · 0)+ 0.2(0.7· 5 + 0.3 · 1)= 2.44 • General formula (cf., e.g., Leyton-Brown et al., 2008): Πn Expected payoffi,t(s)= ∑ k=1sk,ik · payoffi(si1 ,..., sin ) n (i1,...,in)∈A

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 12 Multi-agent learning Repeated games

Expected payoffs for P1 and P2 in stage PD with mixed strategies

Player 1 may only move “back – front”; Player 2 may only move “left – right”.

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 13 Multi-agent learning Repeated games

Subgame perfect equilibria of G∗(δ): D∗

Recall: a subgame perfect Nash equilibrium of an extensive (in this case: repeated) game is a Nash equilibrium of this extensive game, of which the restriction to every subgame (read: tailgame) also is a Nash equilibrium to that subgame. Consider the strategy of iterated defection D∗: “always defect, no matter what”.a Claim. The strategy profile (D∗, D∗) is a subgame perfect equilibrium in G∗(δ). Proof. Consider any tailgame starting at round t. We are done if we can show that (D∗, D∗) is a NE for this subgame. This is true: given that one player always defects, it never pays off for the other player to play C at any time. Hence, everyone plays D∗.

aA notation like D∗ or (worse) D∞ is suggestive. Mathematically it makes no sense, but intu- itively it does.

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 14 Multi-agent learning Repeated games

Part II: Trigger strategies

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 15 Multi-agent learning Repeated games

Example 3: Cost of deviating in Round 4 of the repeated PD

Consider the so-called trigger strategy T: “always play C unless D has been played at least once. In that case play D forever”. Claim. The strategy profile (T, T) is a subgame perfect equilibrium in G∗(δ), provided the probability of continuation, δ, is sufficiently large. Proof. Consider a typical play:

Row player: CCCCCDDDDD . . . Column player: CCCCDDDDDD . . . 01234 5 6 7 8 9...

Column player defects after Round 4. By doing so he expects a payoff of

4 ∞ ∑ δt· 3 + δ5· 5 + ∑ δt· 1 t=0 t=6

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 16 Multi-agent learning Repeated games

Example 3: Cost of deviating in Round 4 ( 2 ) By cooperating throughout, the column player could have expected ∞ ∑ δt· 3 t=0 which means he forfeited ∞ 4 ∞ ∞ ∑ δt· 3 − ∑ δt· 3 + δ5· 5 + ∑ δt· 1 = −2δ5 + 2 ∑ δt t=0 t=0 t=6 ! t=6 by deviating from T. If δ 6= 0, ∞ ∞ ∞ −2δ5 + 2 ∑ δt > 0 ⇔ 1 − ∑ δt > 0 ⇔ 1 − ∑ δt + 1 > 0 t=6 t=1 t=0 1 1 ⇔ 1 − + 1 > 0 ⇔ δ > 1 − δ 2 Thus, if δ > 1/2, the column player forfeits payoff by deviating from T.

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 17 Multi-agent learning Repeated games

Analysis of trigger strat. generalised to deviation in round N

Player starts to defect at Round N. By doing so he expects a payoff of

N−1 ∞ ∑ δt· 3 + δN· 5 + ∑ δt· 1 t=0 t=N+1 By playing C throughout, the column player could have expected ∞ ∑ δt· 3 t=0 which means he forfeited ∞ ∞ δN· (3 − 5)+ ∑ δt· (3 − 1)= −2δN + 2 ∑ δt t=N+1 t=N+1 thanks to deviating in round N (and further).

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 18 Multi-agent learning Repeated games

Analysis of trigger strat. generalised to deviation in round N

If δ 6= 0, and ∞ −2δN + 2 ∑ δt > 0 t=N+1 then there is forfeit from payoff. This is when

∞ − 2δN + 2 ∑ δt > 0 t=N+1 ∞ ∞ ⇔ 1 − ∑ δt−N > 0 ⇔ 1 − ∑ δ(t−N)+(N+1) > 0 t=N+1 t=0 ∞ δ δt > δ 1 > δ > 1 ⇔ 1 − ∑ 0 ⇔ 1 − δ 0 ⇔ t=0 1 − 2 Thus, if δ > 1/2 every player forfeits payoff by deviating from T.

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 19 Multi-agent learning Repeated games

Example 4: An alternating trigger strategy for the repeated PD

Yet another subgame perfect equilibrium: Informal definition of strategies. A and B tacitly agree to alternate actions, i.e. (C, D), (D, C), (C, D), . . . . If one of them deviates, the other party plays D forever. (Consequently, the party who originally deviated plays D forever thereafter as well.) Notice the CKR aspect! Let A be the strategy that plays C in Round 1. Let B be the other strategy. Claim. The strategy profile (A, B) is a subgame perfect equilibrium in G∗(δ), provided the probability of continuation, δ, is sufficiently large. An analysis of this situation and a proof of this claim can be found in (Peters, 2008), pp. 104-105.*

*H. Peters (2008): Game Theory: A Multi-Leveled Approach. Springer, ISBN: 978-3-540-69290-4.

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 20 Multi-agent learning Repeated games

Generalisation of trigger strategies The idea of trigger strategies can be generalised. • Both parties A, B reside in a strategy pair that consists of patterns of repeated action profiles of the stage game PD. • Every convex combinationa of payoffs α α α α 1(3, 3)+ 2(0, 5)+ 3(5, 0)+ 4(1, 1) can be established by smartly picking appropriate strategy patterns. E.g.: “We play 4 times (C, C). Then we play 7 times (C, D), (D, C), . . . ”. • As long as these limiting average payoffs exceed payoff({D, D}) for each player (which is 1), associated trigger strategies can be formulated that lead to these payoffs and trigger eternal play of (D, D) after a deviation. • For δ high enough, such strategies again form a SGP Nash equilibrium. a α α α α α Meaning i ≥ 0 and 1 + 2 + 3 + 4 = 1.

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 21 Multi-agent learning Repeated games

Folk theorem for SGP Nash equilibria in the repeated PD 1. Feasible payoffs (striped): payoff combos that can be obtained by 5 • jointly repeating patterns of actions (more accurately: patterns 4 of action profiles). (3, 3) • 2. Enforceable payoffs (shaded): 3 everyone resides above minmax. 2 For every payoff pair (x, y) in (1) ∩ (2), there is a δ(x, y) ∈ (0, 1), such 1 • that for all δ ≥ δ(x, y) the payoff (x, y) can be obtained as the lim- 0 • iting average in a subgame perfect 012345 equilibrium of G∗(δ).

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 22 Multi-agent learning Repeated games

Family of Folk Theorems

There actually exist many Folk Theorems. • Horizon. May the game be repeated infinitely (as in our case) or is there an upper bound to the number of plays? • Information. Do players act on the basis of CKR (present case), or are certain parts of the history hidden? • Reward. Do players collect their payoff through a discount factor (present case) or through average rewards? • Equilibrium. Do we consider Nash equilibria, or other forms of equilibria, such as so-called ǫ-Nash equilibria or so-called correlated equilibria? • Subgame perfectness. Do we consider subgame perfect equilibria (present case) or just Nash equilibria?

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 23 Multi-agent learning Repeated games

General theorems

For the prisoners’ dilemma game PD we have established that each player playing always D is a subgame perfect equilibrium of the repeated game based on PD. This is the one-stage deviation principle in action. The following result follows from exactly the same logic. Theorem. Let G be an arbitrary (not necessarily finite) n-person game, and let the strategy combination s be a Nash equilibrium of the stage game G. Let δ ∈ (0, 1). Then each player i playing si at every moment t is a subgame perfect equilibrium in G∗(δ). Theorem (Folk theorem for subgame perfect equilibrium). Let (p, q) be a Nash equilibrium of the stage game G, and let (x, y) ∈hGi such that x > pAq and y > pBq. Then there is a δ(x, y) ∈ (0, 1) such that for every δ ≥ δ(x, y) there is a subgame perfect equilibrium in the repeated game G∗(δ).

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 24 Multi-agent learning Repeated games

Existence of non-SGP Nash equilibria in repeated games

• We have seen that many subgame perfect equilibria exist for repeated games. (At least for repeated games were both players have all information, horizon is infinite, and more assumptions). • What about the existence of non-SGP Nash equilibria in repeated games, i.e., equilibria that are not necessarily subgame perfect? • Without the requirement of subgame perfection, deviations can be punished more severely: the equilibrium does not have to induce a Nash equilibrium in the punishment subgame. • We will now consider the consequences of relaxing the subgame perfection requirement for a Nash equilibrium in an infinitely repeated game.

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 25 Multi-agent learning Repeated games

Example 5: A repeated game with a non-SGP NE Other: Some game Left ( L ) Right ( R ) You: Up ( U ) (1, 1) (0, 0) Down ( D ) (0, 0) (−1, 4)

1. For you, U is a dominating strategy. 2. The pure profile (U, L) is the only mixed strategy profile that is a Nash equilibrium. (Hence, (U, L), (U, L),... is a SGP-NE in the repeated game.) 3. Define trigger-strategies (T1, T2) such that the pattern [(D, R), (U, L)3]∗ is played indefinitely. (So we have periods of length 4.) If this pattern is violated: • Fallback strategy of the row player (you) is mixed (0.8, 0.2)∗. • Fallback strategy of the column player is pure R∗. This combination of fallback strategies is not a Nash equilibrium. (Cf. 2.)

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 26 Multi-agent learning Repeated games

Example 5: A repeated game with a non-SGP NE ( 2 ) Other: Some game Left ( L ) Right ( R ) You: Up ( U ) (1, 1) (0, 0) Down ( D ) (0, 0) (−1, 4)

Claim. The combination of trigger strategies (T1, T2) is a Nash-equilibrium for some δ ∈ (0, 1). • T1 ⇒ T2. If you play (the non-degenerated part of) T1, then the column player cannot do much different than T2, for T2 is a best response to T1. • T2 ⇒ T1. If at all, the best moment for you to deviate is at D, for that would give you a incidental advantage of 1. After that your opponent falls back to R∗. Total payoff for you: 0 (for cheating) + 0 + ··· + 0 (for being punished by your opponent).

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 27 Multi-agent learning Repeated games

Example 5: A repeated game with a non-SGP NE ( 3 )

Other: Some game Left ( L ) Right ( R ) You: Up ( U ) (1, 1) (0, 0) Down ( D ) (0, 0) (−1, 4)

• T2 ⇒ T1 (continued). Total payoff for row player: 0 (for cheating) + 0 + ··· + 0 (for being punished by the column player.) Payoff for row player if he was loyal:

(−1 + 1· δ + 1· δ2 + 1· δ3)+(−1· δ4 + 1· δ5 + 1· δ6 + 1· δ7)+ . . . ∞ ∞ 1 1 = δk − 2 δ4k = − 2 ∑ ∑ δ δ4 k=0 k=0 1 − 1 − This expression is positive only if δ >≈ 0.54. (Solve 3rd-degree equation.)

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 28 Multi-agent learning Repeated games

Retaliation in repeated games: playing the minmax value

In the previous game, your “Plan B” was to play a mixed strategy (0.8, 0.2). Questions: • Why may a mixed strategy (0.8, 0.2) be considered a punishment? • Is mixed strategy (0.8, 0.2) the most severe punishment? • If so, why?

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 29 Multi-agent learning Repeated games

Retaliation in repeated games: playing the minmax value

Other: (= payoffs of opponent). 0 1 2 It turns out that Action 1 keeps You: 0 (4, 9) (7, 9) (5, 8) the payoff of your opponent below 9. 1 (6, 7) (8, 7) (4, 8) 2 (7, 9) (5, 7) (6, 7) • Similarly, if your opponent wishes to punish you, he scans First for pure strategies. “green columns” to minimise • Which action of you (the row your payoff ⇒ Action 2. player) minimises the maximum • Alert 1: minmax may be payoff of your opponent? 6= maxmin (= security level • This is the pure minmax: minimise strategy). maximum payoff, which can be • Alert 2: mixed minmax may be < found by “scanning blue rows” pure minmax.

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 30 Multi-agent learning Repeated games

Minmax: payoff surface of the opponent (mixed strategies)

push payoff range to the minimum his payoff

your mix

his mix

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 31 Multi-agent learning Repeated games

The minmax value as a threat

• Your (possibly mixed) minmax strategy can be used as a threat to withhold you opponent from deviating of the silently agreed path. Cf. the threat:

If you’re not complying to our normal pattern of actions, I am going to minmax you.

• By actually executing this threat you might harm yourself as well ⇒ non-SGP.

• For finite two-person strictly competitive games (such as zero-sum games), minmax = maxmin. (Finite in the sense that the arsenal of playable actions, hence the payoff matrix, is finite.)

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 32 Multi-agent learning Repeated games

Example 5: A repeated game with a non-SGP NE ( 4 )

Other: L R Given your mixed strategy (u, d) You: U (1, 1) (0, 0) your opponent maximises his payoff by choosing the right mix D (0, 0) (−1, 4) (l, r): • Your opponent can punish you maximally by playing R∗. How max ul· 1 + dr· 4 l,r you can punish your opponent is = max ul + 4(1 − u)(1 − l) less obvious. l ∗ = max(5u − 4)l + 4 − 4u • If you play D your opponent l will earn 4∗. If you play U∗, your opponent will earn 1∗. • If 5u − 4 = 0, it does not matter what you opponent chooses for • It is possible to punish your l—his expected payoff always opponent even more by becoming equals 4 − 4(4/5)= 4/5. unpredictable (within CKR!!).

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 33 Multi-agent learning Repeated games

Example 5: A repeated game with a non-SGP NE ( 5 ) Draw a picture of the payoff surface of the opponent. 1 0 4 • If 5u − 4 = 0, it does not matter what you opponent chooses for l. He expects 4 − 4(4/5)= 4/5. • If 5u − 4 > 0 your opponent will play 1, and expects to earn > 4 − 4u which is > 4/5. • If 5u − 4 < 0 your opponent will play 0, and expects to earn > 4 − l 4u which, again, is > 4/5. These calculations are done by hand, and do not easily generalise to 0 1 0 u 4/5 higher dimensions. 0 1

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 34 Multi-agent learning Repeated games

Literature Literature on game theory is vast. For me, the following sources were helpful and (no less important) offered different perspectives to repeated games.

* C.M. Gintis (2009): Game Theory Evolving: A Problem-Centered Introduction to Modeling Strategic Interaction. Second Edition. University Press, Princeton. Ch. 9: Repeated Games.

* H. Peters (2008): Game Theory: A Multi-Leveled Approach. Springer, ISBN: 978-3-540-69290-4. Ch. 8: Repeated games.

* K. Leyton-Brown & Y. Shoham (2008): Essentials of Game Theory: A Concise, Multidisciplinary Introduction. Morgan and Claypool Publishers, 2008. Ch. 6: Repeated and Stochastic Games.

* S.P. Hargreaves Heap & Y. Varoufakis (2004): Game theory: a critical text, Routledge. Ch. 5: THE PRISON- ERS’DILEMMA - The riddle of co-operation and its implications for collective agency.

* J. Ratliff (1997): Graduate-Level Course in Game Theory. (AKA: “Jim Ratliff’s Graduate-Level Course in Game Theory”. Lecture notes, Dept. of Economics, University of Arizona. (Available through the web but not officially published.) Sec. 5.3: A Folk Theorem Sampler.

* M.J. Osborne & A. Rubinstein (1994): A Course in Game Theory. MIT Press. Ch. 8: Repeated games.

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 35 Multi-agent learning Repeated games

What next?

Now that we know that infinitely many equilibria exist in repeated games (an embarrassment of richness), there are a number of ways in which we may proceed. • Gradient Dynamics. This is to approximate NE of single-shot games (stage games) through gradient ascent (hill-climbing).

• Reinforcement Learning. Agents simply execute the action(s) with maximal rewards in the past.

• No-regret learning. Agents execute the action(s) with maximal virtual rewards in the past.

• Fictitious Play. Sample the actions of opponent(s) and play a best response.

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 36