Memory-Two Strategies Forming Symmetric Mutual Reinforcement Learning Equilibrium in Repeated Prisoner's Dilemma Game

Memory-two strategies forming symmetric mutual reinforcement learning equilibrium in repeated prisoner’s dilemma game Masahiko Ueda Graduate School of Sciences and Technology for Innovation, Yamaguchi University, Yamaguchi 753-8511, Japan Abstract We investigate symmetric equilibria of mutual reinforcement learning when both players alternately learn the optimal memory-two strategies against the opponent in the repeated prisoner’s dilemma game. We provide the necessary condition for memory-two deterministic strategies to form symmetric equilibria. We then provide two examples of memory-two deterministic strategies which form symmetric mutual reinforcement learning equilibria. We also prove that mutual reinforcement learning equilibria formed by memory-two strategies are also mutual reinforcement learning equilibria when both players use reinforcement learning of memory-n strategies with n> 2. Keywords: Repeated prisoner’s dilemma game; Reinforcement learning; Memory-two strategies 1. Introduction Learning in game theory has been attracted much attention [1, 2, 3, 4, 5]. Because rationality of human beings is bounded [6], modeling of human beings as learning agents has been used in game theory and theoretical economics. One of the most popular learning methods is reinforcement learning [7]. In reinforcement learning, an agent gradually learns the optimal policy against a stationary environment. Mutual reinforcement learning in game theory is a more difficult problem since the existence of multiple agents makes an environment nonsta- arXiv:2108.03258v1 [physics.soc-ph] 5 Aug 2021 tionary [8, 9, 10, 11, 12]. Several methods have been proposed for reinforcement learning with multiple agents [13]. Here, we investigate mutual reinforcement learning in the repeated prisoner’s dilemma game [14]. More explicitly, we investigate properties of equilibria formed by learning agents when the two agents alternately learn their optimal strategies against the opponent. In the previous study [15], it was Email address: [email protected] (Masahiko Ueda) Preprint submitted to Elsevier August 10, 2021 found that, among all deterministic memory-one strategies, only the Grim trig- ger strategy, the Win-Stay Lose-Shift strategy, and the All-D strategy can form symmetric equilibrium of mutual reinforcement learning. Recently, memory-n strategies with n > 1 attract much attention in game theory because longer memory enables more complicated behavior [16, 17, 18, 19, 20, 21]. However, even whether the above equilibria formed by memory-one strategies are still equilibria in memory-n settings or not has not been known. In this paper, we extend the analysis of Ref. [15] to memory-two strategies. First, we provide the necessary condition for memory-two deterministic strategies to form symmetric equilibria. Then we provide two non-trivial examples of memory-two deterministic strategies which form symmetric mutual reinforcement learning equilibria. Furthermore, we also prove that mutual reinforcement learning equilibria formed by memory-n′ strategies are also mutual reinforcement learning equilibria when both players use reinforcement learning of memory-n strategies with n>n′. This paper is organized as follows. In Section 2, we introduce the repeated prisoner’s dilemma game with memory-n strategies, and players using reinforcement learning. In Section 3, we show that the structure of the optimal strategies is constrained by the Bellman optimality equation. In Section 4, we introduce the concepts of mutual reinforcement learning equilibrium and symmetric equilibrium. We then provide the necessary condition for memory-two deterministic strategies to form symmetric equilibria. In Section 5, we provide two examples of memory-two deterministic strategies which form symmetric mutual reinforcement learning equilibria. In Section 6, we show that mutual reinforcement learning equilibria formed by memory-n′ strategies are also mutual reinforcement learning equilibria when both players use reinforcement learning of memory-n strategies with n>n′. Section 7 is devoted to conclusion. 2. Model We introduce the repeated prisoner’s dilemma game [8]. There are two players (1 and 2) in the game. Each player chooses cooperation (C) or defection (D) on every round. The action of player a is written as σa ∈ {C,D}. We collectively write σ := (σ1, σ2), and call σ a state. We also write the space of all possible states as Ω := {C,D}2. The payoff of player a ∈ {1, 2} when the state is σ is described as ra (σ). The payoffs in the prisoner’s dilemma game are given by (r1 (C, C) , r1 (C,D) , r1 (D, C) , r1 (D,D)) = (R,S,T,P ) (1) (r2 (C, C) , r2 (C,D) , r2 (D, C) , r2 (D,D)) = (R,T,S,P ) (2) with T>R>P>S and 2R>T + S. The memory-n strategy (n ≥ 1) n a T σ σ(−m) of player is described as the conditional probability a a| m=1 of n σ n σ(−m) taking action a when the states in the previous rounds are m=1, n where we have introduced the notation σ(−m) := σ(−1), ··· , σ(−n) [21]. m=1 2 We write the length of memory of player a as na and define n := max {n1,n2}. In this paper, we assume that n is finite. Below we introduce the notation −a := {1, 2}\a. We consider the situation that both players learn their optimal strategies against the strategy of the opponent by reinforcement learning [7]. In reinforcement learning, each player learns mapping (called policy) from the states n σ(−m) n σ m=1 in the previous rounds to his/her action so as to maximize his/her expected future reward. We write the action of player a at round t as σa(t). In addition, we write ra(t) := ra (σ(t)). We define the action-value function of player a as ∞ n n σ(−m) E k σ t−n σ(−m) Qa σa, := γ ra(t + k) σa(t)= σa, [ (s)]s=t−1 = , m=1 " m=1# h i k=0 h i X (3) where γ is a discounting factor satisfying 0 ≤ γ < 1. The action-value function n Q σ , σ(−m) a a a m=1 represents the expected future payoffs of player by tak- n σ n σ(−m) ing action a when states in the previous rounds are m=1. It should be noted that the right-hand side does not depend on t. Due to the property of memory-n strategies, the action-value function Qa obeys the Bellman equation against a fixed strategy T−a of the opponent: n (−m) Qa σa, σ m=1 n h i (−m) = ra (σ) T−a σ−a| σ m=1 σ−a X h i n−1 n n−1 ′ σ σ(−m) σ(−m) ′ σ σ(−m) +γ Ta σa| , T−a σ−a| Qa σa, , . ′ m=1 m=1 m=1 σ σ−a Xa X h i h i h i (4) See Appendix A for the derivation of Eq. (4). It has been known that the ∗ ∗ optimal policy Ta and the optimal action-value function Qa obeys the following Bellman optimality equation: n ∗ σ(−m) Qa σa, m=1 n h i (−m) = ra (σ) T−a σ−a| σ m=1 σ−a X h i n n−1 σ(−m) ∗ σ σ(−m) +γ T−a σ−a| max Qa σ,ˆ , , (5) m=1 σˆ m=1 σ−a X h i h i with the support n n ∗ σ(−m) ∗ σ(−m) suppTa ·| = argmax Qa σ, . (6) m=1 σ m=1 h i h i 3 See Appendix B for the derivation of Eqs. (5) and (6). In other words, in the optimal policy against T−a, player a takes the action σa which maximizes n Q∗ , σ(−m) n the value of a · m=1 when the states at the previous rounds are n σ(−m) m=1. We investigate the situation that players infinitely repeat the infinitely- repeated games and players alternately learn their optimal strategies in each game, as in Ref. [15]. We write the optimal strategy and the corresponding ∗(d) ∗(d) optimal action-value function of player a at d-th game as Ta and Qa , re- ∗(0) spectively. Given an initial strategy T2 of player 2, in the (2l − 1)-th game N ∗(2l−1) ∗(2l−2) ∗(2l−1) (l ∈ ), player 1 learns T1 against T2 by calculating Q1 . In the ∗(2l) ∗(2l−1) ∗(2l) 2l-th game, player 2 learns T2 against T1 by calculating Q2 . We are ∗(∞) ∗(∞) interested in the fixed points of the dynamics, that is, Ta and Qa . In this paper, we mainly investigate situations that the support (6) contains only one action, that is, strategies are deterministic. The number of determinis- 2n tic memory-n strategies in the repeated prisoner’s dilemma game is 22 , which increases rapidly as n increases. 3. Structure of optimal strategies Below we consider only the case n = 2. The Bellman optimality equation (5) for n = 2 is ∗ σ(−1) σ(−2) Qa σa, , (−1) (−2) = ra (σ) T−a σ−a| σ , σ σ−a X σ(−1) σ(−2) ∗ σ σ(−1) +γ T−a σ−a| , max Qa σ,ˆ , (7) σˆ σ−a X with ∗ σ(−1) σ(−2) ∗ σ(−1) σ(−2) suppTa ·| , = argmax Qa σ, , . (8) σ The number of memory-two deterministic strategies is 216, which is quite large, and therefore we cannot investigate all memory-two deterministic strategies as in the case of memory-one deterministic strategies [15]. Instead, we first investigate general properties of optimal strategies. We introduce the matrix representation of a strategy: Ta (σ) Ta (σ| (C, C), (C, C)) Ta ( σ| (C, C), (C,D)) Ta ( σ| (C, C), (D, C)) Ta (σ| (C, C), (D,D)) Ta ( σ| (C,D), (C, C)) Ta (σ| (C,D), (C,D)) Ta (σ| (C,D), (D, C)) Ta (σ| (C,D), (D,D)) := .

Load more