Repeated Games
Total Page:16
File Type:pdf, Size:1020Kb
Multi-agent learning Repeated games Multi-agent learning Repeated games Gerard Vreeswijk, Intelligent Systems Group, Computer Science Department, Faculty of Sciences, Utrecht University, The Netherlands. Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 1 Multi-agent learning Repeated games Repeated games: motivation 1. Much interaction in multi-agent systems can be modelled through games. 2. Much learning in multi-agent systems can therefore be modelled through learning in games. 3. Learning in games usually takes place through the (gradual) adaption of strategies (hence, behaviour) in a repeated game. 4. In most repeated games, one game (a.k.a. stage game) is played repeatedly. Possibilities: • A finite number of times. • An indefinite (same: indeterminate) number of times. • An infinite number of times. 5. Therefore, familiarity with the basic concepts and results from the theory of repeated games is essential to understand multi-agent learning. Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 2 Multi-agent learning Repeated games Plan for today • NE in normal form games that are repeated a finite number of times. – Principle of backward induction. • NE in normal form games that are repeated an indefinite number of times. – Discount factor. Models the probability of continuation. – Folk theorem. (Actually many FT’s.) Repeated games generally do have infinitely many Nash equilibria. – Trigger strategy, on-path vs. off-path play, the threat to “minmax” an opponent. This presentation draws heavily on (Peters, 2008). * H. Peters (2008): Game Theory: A Multi-Leveled Approach. Springer, ISBN: 978-3-540-69290-4. Ch. 8: Repeated games. Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 3 Multi-agent learning Repeated games Example 1: Nash equilibria in playing the PD twice Other: Prisoners’ Dilemma Cooperate Defect You: Cooperate (3, 3) (0, 5) Defect (5, 0) (1, 1) • Even if mixed strategies are allowed, the PD possesses one Nash equilibrium, viz. (D, D) with payoffs (1, 1). • This equilibrium is Pareto sub-optimal. (Because (3, 3) makes both players better off.) • Does the situation change if two parties get to play the Prisoners’ Dilemma two times in succession? • The following diagram (hopefully) shows that playing the PD two times in succession does not yield an essentially new NE. Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 4 Multi-agent learning Repeated games Example 1: Nash equilibria in playing the PD twice ( 2 ) (0, 0) CC CD DC DD (3, 3) (0, 5) (5, 0) (1, 1) CC CD DC DD CC CD DC DD CC CD DC DD CC CD DC DD (6, 6) (3, 8) (8, 3) (4, 4) (3, 8) (0, 10) (5, 5) (1, 6) (8, 3) (5, 5) (10, 0) (6, 1) (4, 4) (1, 6) (6, 1) (2, 2) Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 5 Multi-agent learning Repeated games Example 1: Nash equilibria in playing the PD twice ( 3 ) In normal form: Other: CC CD DC DD You: CC (6, 6) (3, 8) (3, 8) (0, 10) CD (8, 3) (4, 4) (5, 5) (1, 6) DC (8, 3) (5, 5) (4, 4) (1, 6) DD (10, 0) (6, 1) (6, 1) (2, 2) • The action profile (DD, DD) is the only Nash equilibrium. • With 3 successive games, we obtain a 23 × 23 matrix, where the action profile (DDD, DDD) still would be the only Nash equilibrium. • Generalise to N repetitions: (DN, DN) still is the only Nash equilibrium in a repeated game where the PD is played N times in succession. Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 6 Multi-agent learning Repeated games Backward induction (version for repeated games) • Suppose G is a game in normal form for p players, where all players possess the same arsenal of possible actions A = {a1,..., am}. • The game Gn arises by playing the stage game G a number of n times in succession. • A history h of length k is an element of (Ap)k, e.g., for p = 3 and k = 10, a7 a5 a3 a6 a1 a9 a2 a7 a7 a3 a6 a9 a2 a4 a2 a9 a9 a1 a1 a4 a1 a2 a7 a9 a6 a1 a1 a8 a2 a4 is a history of length ten in a game with three players. The set of all kp possible histories is denoted by H. (Hence, |Hk| = m .) • A (possibly mixed) strategy for one player is a function H → Pr(A). Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 7 Multi-agent learning Repeated games Backward induction (version for repeated games) • For some repeated games of length n, the dominating (read: “clearly best”) strategy for all players in round n (the last round) does not depend on the history of play. E.g., for the Prisoners’ Dilemma in last round: “No matter what happened in rounds 1... n − 1, I am better off playing D.” • Fixed strategies (D, D) in round n determine play after round n − 1. • Independence on history, plus a determined future, leads to the following justification for playing D in round n − 1: “No matter what happened in rounds 1... n − 2 (the past), and given that I will receive a payoff of 1 in round n (the future), I am better off playing D now.” • Per induction in round k, where k ≥ 1: “No matter what happened in rounds 1... k, and given that I will receive a payoff of (n − k) · 1 in rounds (k + 1) . n, I am better off playing D in round k.” Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 8 Multi-agent learning Repeated games Indefinite number of repetitions • A Pareto-suboptimal outcome can be avoided in case the following three conditions are met. 1. The Prisoners’ Dilemma is repeated an indefinite number of times (rounds). 2. A so-called discount factor δ ∈ [0, 1] determines the probability of continuing the game after each round. 3. The probability to continue, δ, must be large enough. • Under these conditions suddenly infinitely many Nash equilibria exist. This is sometimes called an embarrassment of richness (Peters, 2008). • Various Folk theorems state the existence of multiple equilibria in infinitely repeated games.a • We now informally discuss one version of “the” Folk Theorem. aFolk Theorems are named such, because their exact origin cannot be traced. Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 9 Multi-agent learning Repeated games Example 2: Prisoners’ Dilemma repeated indefinitely • Consider the game G∗(δ) where the PD is played a number of times in ∗ δ succession. We write G ( ) : G0, G1, G2,... • The number of times the stage game is played is determined by a parameter 0 ≤ δ ≤ 1. The probability that the next stage (and the stages thereafter) will be played is δ. t Thus, the probability that stage game Gt will be played is δ . (What if t = 0?) • The PD (of which every Gt is an incarnation) is called the stage game, as opposed to the overall game G∗(δ). • A history h of length t of a repeated game is a sequence of action profiles of length t. • A realisation h is a countably infinite sequence of action profiles. Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 10 Multi-agent learning Repeated games Example 2: Prisoners’ Dilemma repeated indefinitely ( 2 ) • Example of a history of length t = 10: Row player: CDDDCCDDDD Column player: CDDDDDDCDD 0123456789 • The set of all possible histories (of any length) is denoted by H. • A (mixed) strategy for Player i is a function si : H → Pr({C, D}) such that Pr( Player i plays C in round |h| + 1 | h )= si(h)(C). • A strategy profile s is a combination of strategies, one for each player. • The expected payoff for player i given s can be computed. It is ∞ δt Expected payoffi(s)= ∑ Expected payoffi,t(s). t=0 Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 11 Multi-agent learning Repeated games Example: The expected payoff of a stage game Prisoners’ Dilemma Other: Cooperate Defect You: Cooperate (3, 3) (0, 5) Defect (5, 0) (1, 1) • Suppose following strategy profile for one game: – Row player (you) plays with mixed strategy 0.8 on C (hence, 0.2 on D). – Column player (other) plays with mixed strategy 0.7 on C. • Your expected payoff is 0.8(0.7· 3 + 0.3 · 0)+ 0.2(0.7· 5 + 0.3 · 1)= 2.44 • General formula (cf., e.g., Leyton-Brown et al., 2008): Πn Expected payoffi,t(s)= ∑ k=1sk,ik · payoffi(si1 ,..., sin ) n (i1,...,in)∈A Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 12 Multi-agent learning Repeated games Expected payoffs for P1 and P2 in stage PD with mixed strategies Player 1 may only move “back – front”; Player 2 may only move “left – right”. Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 13 Multi-agent learning Repeated games Subgame perfect equilibria of G∗(δ): D∗ Recall: a subgame perfect Nash equilibrium of an extensive (in this case: repeated) game is a Nash equilibrium of this extensive game, of which the restriction to every subgame (read: tailgame) also is a Nash equilibrium to that subgame. Consider the strategy of iterated defection D∗: “always defect, no matter what”.a Claim. The strategy profile (D∗, D∗) is a subgame perfect equilibrium in G∗(δ). Proof. Consider any tailgame starting at round t. We are done if we can show that (D∗, D∗) is a NE for this subgame.