arXiv:2108.03258v1 [physics.soc-ph] 5 Aug 2021 .Introduction 1. reinf use players both memory- when of learning equilibria th learning prove strat memory-two reinforcement also by mutual We formed equilibria learning equilibria. reinforcement learning tual strategies reinforcement deterministic memory-two mutual eq of symmetric symmetric examples form two necess to provide the strategies provide then We deterministic memory-two game. for dilemma again tion prisoner’s strategies repeated memory-two the in optimal nent the learn learning alternately reinforcement players mutual of equilibria symmetric investigate We Abstract inr 8 ,1,1,1] eea ehd aebe rpsdfo proposed been [13]. have environment agents methods an multiple Several makes with 12]. learning agents 11, 10, multiple mor 9, a of [8, is existence tionary theory the s game since a in against problem learning policy reinforcement optimal I the Mutual learns [7]. environment. gradually learning agent reinforcement an ec is theoretical learning, methods b ment and learning human theory popular game of most in the modeling used of [6], been bounded has agents is learning beings as human of rationality Because rpitsbitdt Elsevier to submitted Preprint [1 study previous the lear In alternately agents e opponent. two of the the properties when against investigate agents strategies we learning explicitly, optimal by More formed [14]. libria game dilemma oner’s eoytostrategies Memory-two Keywords: erigi aeter a enatatdmc teto 1 ,3 2, [1, attention much attracted been has theory game in Learning ee eivsiaemta enocmn erigi h repeate the in learning reinforcement mutual investigate we Here, mi address: Email rdaeSho fSine n ehooyfrInnovation, for Technology and Sciences of School Graduate eoytosrtge omn ymti mutual symmetric forming strategies Memory-two enocmn erigeulbimi repeated in equilibrium learning reinforcement eetdpioe’ iem ae enocmn learning; Reinforcement game; dilemma prisoner’s Repeated [email protected] n rsnrsdlmagame dilemma prisoner’s taeiswith strategies aauh 5-51 Japan 753-8511, Yamaguchi aaioUeda Masahiko > n Mshk Ueda) (Masahiko 2. aauh University, Yamaguchi reinforcement r uut1,2021 10, August nmc.One onomics. ge r also are egies tteoppo- the st iira We uilibria. hc form which reinforce- n hnboth when r condi- ary ] twas it 5], difficult e tationary orcement nonsta- their n tmu- at pris- d ,5]. 4, , eings qui- found that, among all deterministic memory-one strategies, only the Grim trig- ger , the Win-Stay Lose-Shift strategy, and the All-D strategy can form symmetric equilibrium of mutual reinforcement learning. Recently, memory-n strategies with n > 1 attract much attention in because longer memory enables more complicated behavior [16, 17, 18, 19, 20, 21]. However, even whether the above equilibria formed by memory-one strategies are still equilibria in memory-n settings or not has not been known. In this paper, we extend the analysis of Ref. [15] to memory-two strate- gies. First, we provide the necessary condition for memory-two deterministic strategies to form symmetric equilibria. Then we provide two non-trivial ex- amples of memory-two deterministic strategies which form symmetric mutual reinforcement learning equilibria. Furthermore, we also prove that mutual re- inforcement learning equilibria formed by memory-n′ strategies are also mutual reinforcement learning equilibria when both players use reinforcement learning of memory-n strategies with n>n′. This paper is organized as follows. In Section 2, we introduce the repeated prisoner’s dilemma game with memory-n strategies, and players using reinforce- ment learning. In Section 3, we show that the structure of the optimal strategies is constrained by the Bellman optimality equation. In Section 4, we introduce the concepts of mutual reinforcement learning equilibrium and symmetric equi- librium. We then provide the necessary condition for memory-two deterministic strategies to form symmetric equilibria. In Section 5, we provide two examples of memory-two deterministic strategies which form symmetric mutual reinforce- ment learning equilibria. In Section 6, we show that mutual reinforcement learn- ing equilibria formed by memory-n′ strategies are also mutual reinforcement learning equilibria when both players use reinforcement learning of memory-n strategies with n>n′. Section 7 is devoted to conclusion.

2. Model

We introduce the repeated prisoner’s dilemma game [8]. There are two players (1 and 2) in the game. Each player chooses cooperation (C) or defection (D) on every round. The action of player a is written as σa ∈ {C,D}. We collectively write σ := (σ1, σ2), and call σ a state. We also write the space of all possible states as Ω := {C,D}2. The payoff of player a ∈ {1, 2} when the state is σ is described as ra (σ). The payoffs in the prisoner’s dilemma game are given by

(r1 (C, C) , r1 (C,D) , r1 (D, C) , r1 (D,D)) = (R,S,T,P ) (1)

(r2 (C, C) , r2 (C,D) , r2 (D, C) , r2 (D,D)) = (R,T,S,P ) (2) with T>R>P>S and 2R>T + S. The memory-n strategy (n ≥ 1) n a T σ σ(−m) of player is described as the conditional probability a a| m=1 of n σ n  σ(−m)  taking action a when the states in the previous rounds are  m=1, n where we have introduced the notation σ(−m) := σ(−1), ··· , σ(−n) [21]. m=1      2 We write the length of memory of player a as na and define n := max {n1,n2}. In this paper, we assume that n is finite. Below we introduce the notation −a := {1, 2}\a. We consider the situation that both players learn their optimal strategies against the strategy of the opponent by reinforcement learning [7]. In rein- forcement learning, each player learns mapping (called policy) from the states n σ(−m) n σ m=1 in the previous rounds to his/her action so as to maximize his/her expected future reward. We write the action of player a at round t   as σa(t). In addition, we write ra(t) := ra (σ(t)). We define the action-value function of player a as

∞ n n σ(−m) E k σ t−n σ(−m) Qa σa, := γ ra(t + k) σa(t)= σa, [ (s)]s=t−1 = , m=1 " m=1#  h i  k=0 h i X (3) where γ is a discounting factor satisfying 0 ≤ γ < 1. The action-value function n Q σ , σ(−m) a a a m=1 represents the expected future payoffs of player by tak- n  σ  n σ(−m) ing action a when states in the previous rounds are m=1. It should be noted that the right-hand side does not depend on t. Due to the property of   memory-n strategies, the action-value function Qa obeys the Bellman equation against a fixed strategy T−a of the opponent:

n (−m) Qa σa, σ m=1 n  h i  (−m) = ra (σ) T−a σ−a| σ m=1 σ−a X  h i  n−1 n n−1 ′ σ σ(−m) σ(−m) ′ σ σ(−m) +γ Ta σa| , T−a σ−a| Qa σa, , . ′ m=1 m=1 m=1 σ σ−a Xa X  h i   h i   h i  (4)

See Appendix A for the derivation of Eq. (4). It has been known that the ∗ ∗ optimal policy Ta and the optimal action-value function Qa obeys the following Bellman optimality equation:

n ∗ σ(−m) Qa σa, m=1 n  h i  (−m) = ra (σ) T−a σ−a| σ m=1 σ−a X  h i  n n−1 σ(−m) ∗ σ σ(−m) +γ T−a σ−a| max Qa σ,ˆ , , (5) m=1 σˆ m=1 σ−a X  h i   h i  with the support

n n ∗ σ(−m) ∗ σ(−m) suppTa ·| = argmax Qa σ, . (6) m=1 σ m=1  h i   h i  3 See Appendix B for the derivation of Eqs. (5) and (6). In other words, in the optimal policy against T−a, player a takes the action σa which maximizes n Q∗ , σ(−m) n the value of a · m=1 when the states at the previous rounds are n σ(−m)   m=1.   We investigate the situation that players infinitely repeat the infinitely-   repeated games and players alternately learn their optimal strategies in each game, as in Ref. [15]. We write the optimal strategy and the corresponding ∗(d) ∗(d) optimal action-value function of player a at d-th game as Ta and Qa , re- ∗(0) spectively. Given an initial strategy T2 of player 2, in the (2l − 1)-th game N ∗(2l−1) ∗(2l−2) ∗(2l−1) (l ∈ ), player 1 learns T1 against T2 by calculating Q1 . In the ∗(2l) ∗(2l−1) ∗(2l) 2l-th game, player 2 learns T2 against T1 by calculating Q2 . We are ∗(∞) ∗(∞) interested in the fixed points of the dynamics, that is, Ta and Qa . In this paper, we mainly investigate situations that the support (6) contains only one action, that is, strategies are deterministic. The number of determinis- 2n tic memory-n strategies in the repeated prisoner’s dilemma game is 22 , which increases rapidly as n increases.

3. Structure of optimal strategies Below we consider only the case n = 2. The Bellman optimality equation (5) for n = 2 is

∗ σ(−1) σ(−2) Qa σa, ,

  (−1) (−2) = ra (σ) T−a σ−a| σ , σ σ−a X   σ(−1) σ(−2) ∗ σ σ(−1) +γ T−a σ−a| , max Qa σ,ˆ , (7) σˆ σ−a X     with ∗ σ(−1) σ(−2) ∗ σ(−1) σ(−2) suppTa ·| , = argmax Qa σ, , . (8) σ     The number of memory-two deterministic strategies is 216, which is quite large, and therefore we cannot investigate all memory-two deterministic strategies as in the case of memory-one deterministic strategies [15]. Instead, we first investigate general properties of optimal strategies. We introduce the matrix representation of a strategy:

Ta (σ)

Ta (σ| (C, C), (C, C)) Ta ( σ| (C, C), (C,D)) Ta ( σ| (C, C), (D, C)) Ta (σ| (C, C), (D,D)) Ta ( σ| (C,D), (C, C)) Ta (σ| (C,D), (C,D)) Ta (σ| (C,D), (D, C)) Ta (σ| (C,D), (D,D)) :=   . Ta (σ| (D, C), (C, C)) Ta (σ| (D, C), (C,D)) Ta (σ| (D, C), (D, C)) Ta (σ| (D, C), (D,D))  Ta (σ| (D,D), (C, C)) Ta ( σ| (D,D), (C,D)) Ta ( σ| (D,D), (D, C)) Ta (σ| (D,D), (D,D))     (9)

4 For deterministic strategies, each component in the matrix is 0 or 1. We now prove the following proposition: Proposition 1. For two different states σ(−2) and σ(−2)′, if

(−1) (−2) (−1) (−2)′ T−a σ| σ , σ = T−a σ| σ , σ (∀σ) (10)     holds for some σ(−1), then

∗ σ(−1) σ(−2) ∗ σ(−1) σ(−2)′ Ta σ| , = Ta σ| , (∀σ) (11)     also holds. Proof. For such σ(−1), because of Eq. (7), we find

∗ σ(−1) σ(−2) Qa σa, ,

  (−1) (−2) = ra (σ) T−a σ−a| σ , σ σ−a X   σ(−1) σ(−2) ∗ σ σ(−1) +γ T−a σ−a| , max Qa σ,ˆ , σˆ σ−a X     (−1) (−2)′ = ra (σ) T−a σ−a| σ , σ σ−a X   σ(−1) σ(−2)′ ∗ σ σ(−1) +γ T−a σ−a| , max Qa σ,ˆ , σˆ σ−a X     ∗ σ(−1) σ(−2)′ = Qa σa, , (12)

∗  for all σa. Since Ta is determined by Eq. (8), we obtain Eq. (11). ∗ This proposition implies that the structure of the matrix Ta (σ) is the same as that of T−a(σ). For deterministic strategies, in order to see this in more detail, we introduce the following sets for a ∈{1, 2} and σ(−1) ∈ Ω:

(a) σ(−1) σ(−2) σ(−1) σ(−2) Nx := ∈ Ω Ta C| , = x , (13)   n   o (a) σ (−1) (a) σ(−1) where x ∈{0, 1}. We remark that N0 ∪N1 = Ω for all a and σ(−1). Then, Proposition 1 leads the following corollary:   Corollary 1. For a deterministic strategy T−a of player −a, if the optimal strat- ∗ egy Ta of player a against T−a is also deterministic, then one of the following four relations holds for each σ(−1) ∈ Ω: (a) (−1) (−a) (−1) (a) Nx σ = Nx σ for all x (a) σ(−1) (−a) σ(−1) (b) Nx  = N1−x  for all x (a) σ(−1) (−a) σ(−1) (−a) σ(−1) (a) σ(−1) (c) N0  = N0  ∪ N1 =Ω and N1 = ∅ (a) σ(−1) (−a) σ(−1) (−a) σ(−1) (a) σ(−1) (d) N1  = N0  ∪ N1  =Ω and N0  = ∅.     5 4. Symmetric equilibrium In this section, we investigate symmetric equilibrium of mutual reinforcement learning. First, we introduce the notation C := D, D := C, and π (σ1, σ2) := (σ2, σ1). We define the word same strategy.

Definition 1. A strategy Ta of player a is the same strategy as that of player −a iff

(−1) (−2) (−1) (−2) Ta σ| σ , σ = T−a σ| π σ , π σ (14)        for all σ, σ(−1) and, σ(−2). Next, we introduce equilibria achieved by mutual reinforcement learning.

Definition 2. A pair of strategy T1 and T2 is a mutual reinforcement learning equilibrium iff Ta is the optimal strategy against T−a for a =1, 2. For deterministic mutual reinforcement learning equilibria, the following proposition is the direct consequence of Corollary 1. Proposition 2. For mutual reinforcement learning equilibria formed by deter- ministic strategies, one of the following two relations holds for each σ(−1) ∈ Ω: (1) (−1) (2) (−1) (a) Nx σ = Nx σ for all x (1) σ(−1) (2) σ(−1) (b) Nx  = N1−x  for all x. Proof. According  to Corollary 1, one of the four situations (a)-(d) holds for the optimal strategy T1 against T2. However, because T2 is also the optimal strategy against T1, the cases (c) and (d) are excluded or integrated into the case (a) or (b). Furthermore, we introduce symmetric equilibria of mutual reinforcement learning.

Definition 3. A pair of strategy T1 and T2 is a symmetric mutual reinforcement learning equilibrium iff Ta is the optimal strategy against T−a and Ta is the same strategy as T−a for a =1, 2. It should be noted that the deterministic optimal strategies can be written as ∗ σ(−1) σ(−2) I ∗ σ(−1) σ(−2) ∗ σ(−1) σ(−2) Ta σ| , = Qa σ, , >Qa σ, , ,       (15) where I(··· ) is the indicator function that returns 1 when ··· holds and 0 otherwise. We also introduce the following sets for a ∈{1, 2} and σ(−1) ∈ Ω:

˜ (a) σ(−1) σ(−2) σ(−1) σ(−2) Nx := ∈ Ω Ta C| , π = x , (16)   n    o where x ∈{0, 1}. We now prove the first main result of this paper.

6 Theorem 1. For symmetric mutual reinforcement learning equilibria formed by deterministic strategies, the following relations must hold: (a) For σ(−1) ∈{(C, C), (D,D)},

(−1) (−1) Ta C| σ , (C,D) = Ta C| σ , (D, C) (17)

for all a.     (b) For σ(−1) ∈{(C,D), (D, C)},

(a) σ(−1) ˜ (a) σ(−1) Nx π = Nx (∀x) (18) or      (a) σ(−1) ˜ (a) σ(−1) Nx π = N1−x (∀x) (19) holds.      (−1) (−1) (−1) Proof. For σ ∈{(C, C), (D,D)}, π σ = σ holds. Because T1 and T are the same strategies as each other, 2  (−1) (−2) (−1) (−2) (−2) T1 C| σ , σ = T2 C| σ , σ σ ∈{(C, C), (D,D)}      (20)

(1) (−1) (2) (−1) holds. This and Proposition 2 imply that Nx σ = Nx σ (∀x ∈ {0, 1}) must holds. On the other hand, due to Eq. (14),   (−1) (−1) T1 C| σ , (C,D) = T2 C| σ , (D, C) (21)

 (−1)   (−1)  T1 C| σ , (D, C) = T2 C| σ , (C,D) (22)     (1) (−1) (2) (−1) also hold. This means that, if (C,D) ∈ Nx σ , then (D, C) ∈ Nx π σ = (2) σ(−1) (1) σ(−1) Nx = Nx , leading to Eq. (17).   For σ(−1) ∈{(C,D), (D, C)}, because T and T are the same strategies as   1 2 each other,

(−1) (−2) (−1) (−2) T2 C| π σ , σ = T1 C| σ , π σ (23)        holds for ∀σ(−2) ∈ Ω. This means that (2) σ(−1) ˜ (1) σ(−1) Nx π = Nx (∀x ∈{0, 1}) (24) holds. On the other hand, Proposition 2 implies that (1) σ(−1) (2) σ(−1) Nx π = Nx π (∀x ∈{0, 1}) (25) or       (1) σ(−1) (2) σ(−1) Nx π = N1−x π (∀x ∈{0, 1}) (26)       must hold. By combining Eq. (24) and Eq. (25) or (26), we obtain Eq. (18) or (19).

7 Theorem 1 provides a necessary condition for a deterministic strategy to form a symmetric mutual reinforcement learning equilibrium. In particular, Eqs. (18) and (19) imply that the second row and the third row of Ta cannot be independent of each other. For example, the Tit-for-Tat-anti-Tit-for-Tat (TFT-ATFT) strategy [17]

1 1 1 1 0 0 0 1 T (C) = , (27) 1  0 1 0 1   1 0 1 0      does not satisfy the condition of Theorem 1, and therefore it cannot form a symmetric mutual reinforcement learning equilibrium. However, there are still many memory-two strategies which satisfy the necessary condition, and further refinement will be needed.

5. Examples of deterministic strategies forming symmetric equilib- rium

In this section, we provide two examples of memory-two deterministic strate- gies forming symmetric mutual reinforcement learning equilibrium. For conve- nience, we define the following sixteen quantities:

∗ q1 := R + γ max Q (σ, (C, C), (C, C)) (28) σ 1 ∗ q2 := T + γ max Q (σ, (D, C), (C, C)) (29) σ 1 ∗ q3 := S + γ max Q (σ, (C,D), (C, C)) (30) σ 1 ∗ q4 := P + γ max Q (σ, (D,D), (C, C)) (31) σ 1

∗ q5 := R + γ max Q (σ, (C, C), (C,D)) (32) σ 1 ∗ q6 := T + γ max Q (σ, (D, C), (C,D)) (33) σ 1 ∗ q7 := S + γ max Q (σ, (C,D), (C,D)) (34) σ 1 ∗ q8 := P + γ max Q (σ, (D,D), (C,D)) (35) σ 1

∗ q9 := R + γ max Q (σ, (C, C), (D, C)) (36) σ 1 ∗ q10 := T + γ max Q (σ, (D, C), (D, C)) (37) σ 1 ∗ q11 := S + γ max Q (σ, (C,D), (D, C)) (38) σ 1 ∗ q12 := P + γ max Q (σ, (D,D), (D, C)) (39) σ 1

8 ∗ q13 := R + γ max Q (σ, (C, C), (D,D)) (40) σ 1 ∗ q14 := T + γ max Q (σ, (D, C), (D,D)) (41) σ 1 ∗ q15 := S + γ max Q (σ, (C,D), (D,D)) (42) σ 1 ∗ q16 := P + γ max Q (σ, (D,D), (D,D)) (43) σ 1 The Bellman optimality equation for symmetric equilibrium is

∗ σ(−1) σ(−2) Q1 σ1, ,   σ ∗ σ σ(−1) = r1 ( )+max Q1 σ,ˆ , σˆ σ X2    I ∗ σ(−1) σ(−2) ∗ σ(−1) σ(−2) × Q1 σ2, π , π >Q1 σ2, π , π .           (44)

We want to find solutions of this equation. The first candidate of the solution of Eq. (44) is

10 0 0 10 0 0 T (C)= T (C) = . (45) 1 2  10 0 0   10 0 0      Because this strategy is a variant of the strategy [22]

1 11 1 0 00 0 T (C) = (46) 1  0 00 0   0 00 0      but uses only information at the second last state, the strategy (45) can be called as retarded Grim strategy. Theorem 2. A pair of the strategy (45) forms a symmetric mutual reinforce- T −R ment learning equilibrium if γ > T −P . q Proof. The Bellman optimality equation against the strategy (45) is

∗ Q1 (C, (C, C), (C, C)) = q1 (47) ∗ Q1 (D, (C, C), (C, C)) = q2 (48) ∗ σ(−2) σ(−2) Q1 C, (C, C), = q3 6= (C, C) (49) ∗  σ(−2) σ(−2)  Q1 D, (C, C), = q4 6= (C, C) (50)    

9 ∗ Q1 (C, (C,D), (C, C)) = q5 (51) ∗ Q1 (D, (C,D), (C, C)) = q6 (52) ∗ σ(−2) σ(−2) Q1 C, (C,D), = q7 6= (C, C) (53) ∗  σ(−2) σ(−2)  Q1 D, (C,D), = q8 6= (C, C) (54)     ∗ Q1 (C, (D, C), (C, C)) = q9 (55) ∗ Q1 (D, (D, C), (C, C)) = q10 (56) ∗ σ(−2) σ(−2) Q1 C, (D, C), = q11 6= (C, C) (57) ∗  σ(−2) σ(−2)  Q1 D, (D, C), = q12 6= (C, C) (58)     ∗ Q1 (C, (D,D), (C, C)) = q13 (59) ∗ Q1 (D, (D,D), (C, C)) = q14 (60) ∗ σ(−2) σ(−2) Q1 C, (D,D), = q15 6= (C, C) (61) ∗  σ(−2) σ(−2)  Q1 D, (D,D), = q16 6= (C, C) (62)     with the self-consistency condition

q1 > q2

q3 < q4

q5 > q6

q7 < q8

q9 > q10

q11 < q12

q13 > q14

q15 < q16. (63)

10 The solution is 1 q = R (64) 1 1 − γ γ γ2 q = T + R + P (65) 2 1 − γ2 1 − γ2 γ γ2 q = S + R + P (66) 3 1 − γ2 1 − γ2 1 γ q = P + R (67) 4 1 − γ2 1 − γ2 1 γ q = q = q = R + P (68) 5 9 13 1 − γ2 1 − γ2 γ q = q = q = T + P (69) 6 10 14 1 − γ γ q = q = q = S + P (70) 7 11 15 1 − γ 1 q = q = q = P. (71) 8 12 16 1 − γ For these solution, the inequalities (63) are satisfied if

T − R γ > . (72) T P r −

We remark that the condition (72) is more strict than the condition that T −R Grim forms a symmetric equilibrium [15]: γ > T −P . The second candidate of the solution of Eq. (44) is

10 0 1 10 0 1 T (C)= T (C) = . (73) 1 2  10 0 1   10 0 1      Because this strategy is a variant of the Win-Stay Lose-Shift (WSLS) strategy [23]

1 11 1 0 00 0 T (C) = (74) 1  0 00 0   1 11 1      but uses only information at the second last state, the strategy (73) can be called as retarded WSLS strategy. Theorem 3. A pair of the strategy (73) forms a symmetric mutual reinforce- T −R ment learning equilibrium if γ > R−P . q 11 Proof. The Bellman optimality equation against the strategy (73) is

∗ σ(−2) σ(−2) Q1 C, (C, C), = q1 ∈{(C, C), (D,D)} (75) ∗  σ(−2) σ(−2)  Q1 D, (C, C), = q2 ∈{(C, C), (D,D)} (76) ∗ σ(−2) σ(−2)  Q1 C, (C, C), = q3 ∈{(C,D), (D, C)} (77) ∗  σ(−2) σ(−2)  Q1 D, (C, C), = q4 ∈{(C,D), (D, C)} (78)     ∗ σ(−2) σ(−2) Q1 C, (C,D), = q5 ∈{(C, C), (D,D)} (79) ∗  σ(−2) σ(−2)  Q1 D, (C,D), = q6 ∈{(C, C), (D,D)} (80) ∗ σ(−2) σ(−2)  Q1 C, (C,D), = q7 ∈{(C,D), (D, C)} (81) ∗  σ(−2) σ(−2)  Q1 D, (C,D), = q8 ∈{(C,D), (D, C)} (82)     ∗ σ(−2) σ(−2) Q1 C, (D, C), = q9 ∈{(C, C), (D,D)} (83) ∗  σ(−2)  σ(−2)  Q1 D, (D, C), = q10 ∈{(C, C), (D,D)} (84) ∗  σ(−2) σ(−2)  Q1 C, (D, C), = q11 ∈{(C,D), (D, C)} (85) ∗  σ(−2) σ(−2)  Q1 D, (D, C), = q12 ∈{(C,D), (D, C)} (86)     ∗ σ(−2) σ(−2) Q1 C, (D,D), = q13 ∈{(C, C), (D,D)} (87) ∗  σ(−2) σ(−2)  Q1 D, (D,D), = q14 ∈{(C, C), (D,D)} (88) ∗  σ(−2) σ(−2)  Q1 C, (D,D), = q15 ∈{(C,D), (D, C)} (89) ∗  σ(−2) σ(−2)  Q1 D, (D,D), = q16 ∈{(C,D), (D, C)} (90)     with the self-consistency condition

q1 > q2

q3 < q4

q5 > q6

q7 < q8

q9 > q10

q11 < q12

q13 > q14

q15 < q16. (91)

12 The solution is 1 q = q = R (92) 1 13 1 − γ γ3 q = q = T + γR + γ2P + R (93) 2 14 1 − γ γ3 q = q = S + γR + γ2P + R (94) 3 15 1 − γ γ q = q = P + R (95) 4 16 1 − γ γ2 q = q = R + γP + R (96) 5 9 1 − γ γ3 q = q = T + γP + γ2P + R (97) 6 10 1 − γ γ3 q = q = S + γP + γ2P + R (98) 7 11 1 − γ γ2 q = q = P + γP + R. (99) 8 12 1 − γ

For these solution, the inequalities (91) are satisfied if

T − R γ > . (100) R P r −

We remark that the condition (100) is more strict than the condition that T −R WSLS forms a symmetric equilibrium [15]: γ > R−P .

6. Optimality in longer memory

In previous sections, we investigated symmetric equilibrium of mutual re- inforcement learning when both players use memory-two strategies, and ob- tained two examples of deterministic strategies forming symmetric equilibrium. A natural question is “Do these strategies forming symmetric equilibrium in memory-two reinforcement learning also form symmetric equilibrium of mutual reinforcement learning of longer memory strategies?”. In this section, we show that the answer to this question is “yes”. We first prove the following theorem. ′ ∗ Theorem 4. Let T−a be a memory-n strategy of player −a. Let Ta be the op- timal strategy of player a against T−a when player a use reinforcement learning of memory-n′ strategies. When player a use reinforcement learning of memory- ′ ˇ∗ n strategies with n>n to obtain the optimal strategy Ta against T−a, then ˇ∗ ∗ Ta = Ta .

13 Proof. When player −a use memory-n′ strategy and player a use memory-n re- inforcement learning with n>n′, the Bellman optimality equation (5) becomes n ∗ σ(−m) Qa σa, m=1  h i  n′ (−m) = ra (σ) T−a σ−a| σ m=1 σ−a  h i  X ′ n n−1 σ(−m) ∗ σ σ(−m) +γ T−a σ−a| max Qa σ,ˆ , . m=1 σˆ m=1 σ−a X  h i   h i  (101) Then, we find that the right-hand side does not depend on σ(−n), and therefore

n n−1 ∗ σ(−m) ∗ σ(−m) Qa σa, = Qa σa, (102) m=1 m=1  h i   h i  Then, the Bellman optimality equation becomes

n−1 ∗ σ(−m) Qa σa, m=1   h i n′ (−m) = ra (σ) T−a σ−a| σ m=1 σ−a  h i  X ′ n n−2 σ(−m) ∗ σ σ(−m) +γ T−a σ−a| max Qa σ,ˆ , . m=1 σˆ m=1 σ−a X  h i   h i  (103) By repeating the same argument until the length of memory decreases to n′, we find that n n′ ∗ σ(−m) ∗ σ(−m) Qa σa, = Qa σa, , (104) m=1 m=1  h i   h i  ˇ∗ ∗ which implies that Ta = Ta . This theorem results in the following corollary. Corollary 2. A mutual reinforcement learning equilibrium obtained by memory- n′ reinforcement learning is also a mutual reinforcement learning equilibrium obtained by memory-n reinforcement learning with n>n′. Therefore, the strategies (45) and (73) in the previous section also form mu- tual reinforcement learning equilibria even if players use memory-n reinforce- ment learning with n> 2. Similarly, the (memory-one) Grim strategy and the (memory-one) WSLS strategy still form mutual reinforcement learning equilib- ria even if players use memory-two reinforcement learning, since it has been known that Grim and WSLS form memory-one mutual reinforcement learning equilibria, respectively [15].

14 7. Conclusion

In this paper, we investigated symmetric equilibrium of mutual reinforce- ment learning when both players use memory-two deterministic strategies in the repeated prisoner’s dilemma game. First, we find that the structure of the optimal strategies is constrained by the Bellman optimality equation (Propo- sition 1). Then, we find the necessary condition for deterministic symmetric equilibrium of mutual reinforcement learning (Theorem 1). Furthermore, we provided two examples of memory-two deterministic strategies which form sym- metric mutual reinforcement learning equilibrium, which can be regarded as variants of the Grim strategy and the WSLS strategy (Theorem 2 and Theorem 3). Finally, we proved that mutual reinforcement learning equilibria achieved by memory-two strategies are also mutual reinforcement learning equilibria when both players use reinforcement learning of memory-n strategies with n > 2 (Theorem 4). We want to investigate whether other symmetric mutual reinforcement learn- ing equilibria of deterministic memory-two strategies exist or not in future. Fur- thermore, extension of our analysis to mixed strategies is also a subject of future work.

Acknowledgement

This study was supported by JSPS KAKENHI Grant Number JP20K19884.

Appendix A. Derivation of Eq. (4)

First we introduce the notation

n 2 n (−m) (−m) T σ| σ := Ta σa| σ . (A.1) m=1 m=1 a=1  h i  Y  h i  We remark that the joint probability distribution of the states σ(µ), ··· , σ(0) σ(−m) n given m=1 is described as   µ − n − n P σ(µ), ··· , σ(0) σ( m) = T σ(s) σ(s m) . (A.2) m=1 m=1  h i  s=0  h i  Y

15 The action-value function (3) is rewritten as

n (0) σ(−m) Qa σa , m=1 ∞ ∞  h i  n k (k) (s) (s−m) = γ ra σ T σ σ m=1 s ∞ (0) ( ) σ( ) σ k=0 s=1 [ X]s=1 X−a X   Y  h i  n (0) σ(−m) ×T−a σ−a m=1  h i  ∞ (0) k (k+1) = ra σ + γ γ ra σ s ∞ (0) " # σ( ) σ k=0 [ X]s=1 X−a   X   ∞ n n σ(s) σ(s−m) (0) σ(−m) × T T−a σ−a m=1 m=1 (s=1 ) Y  h i  n  h i  σ(0) (0) σ(−m) = ra T− a σ−a m=1 (0) σ−a    h i  X ∞ ∞ n k (k+1) (s) (s−m) +γ γ ra σ T σ σ m=1 s ∞ (0) ( ) σ( ) σ k=0 s=2 [ X]s=1 X−a X   Y  h i  n n n (1) σ(1−m) (1) σ(1−m) (0) σ(−m) ×T−a σ−a Ta σa T−a σ−a m=1 m=1 m=1 n σ(0) h (0)i σ(−m) h i   h i  = ra T−a σ−a m=1 (0) σ−a    h i  X n n n (1) σ(1−m) (1) σ(1−m) (0) σ(−m) +γ Qa σa , Ta σa T−a σ−a , m=1 m=1 m=1 (1) (0) σa σ−a  h i   h i   h i  X X (A.3) which is Eq. (4).

16 Appendix B. Derivation of Eqs. (5) and (6)

∗ We define Qa as the optimal value of Qa, which is obtained by choosing ∗ ∗ optimal policy Ta . Then, Qa obeys n ∗ σ(−m) Qa σa, m=1 n  h i  (−m) = ra (σ) T−a σ−a| σ m=1 σ−a X  h i  n−1 n n−1 ∗ ′ σ σ(−m) σ(−m) ∗ ′ σ σ(−m) +γ Ta σa| , T−a σ−a| Qa σa, , ′ m=1 m=1 m=1 σa σ−a  h i   h i   h i  X X n (−m) ≤ ra (σ) T−a σ−a| σ m=1 σ−a X  h i  n−1 n n−1 ∗ ′ σ σ(−m) σ(−m) ∗ σ σ(−m) +γ Ta σa| , T−a σ−a| max Qa σ,ˆ , ′ m=1 m=1 σˆ m=1 σa σ−a  h i   h i   h i  X X n (−m) = ra (σ) T−a σ−a| σ m=1 σ−a X  h i  n n−1 σ(−m) ∗ σ σ(−m) +γ T−a σ−a| max Qa σ,ˆ , . (B.1) m=1 σˆ m=1 σ−a X  h i   h i  The equality in the third line holds when n n ∗ σ(−m) ∗ σ(−m) suppTa ·| = argmax Qa σ, , (B.2) m=1 σ m=1  h i   h i  which is Eq. (6).

References

References

[1] E. Kalai, E. Lehrer, Rational learning leads to , Econo- metrica: Journal of the Econometric Society (1993) 1019–1045. [2] D. Fudenberg, D. K. Levine, Steady state learning and nash equilibrium, Econometrica: Journal of the Econometric Society (1993) 547–573. [3] S. Hart, A. Mas-Colell, A simple adaptive procedure leading to , Econometrica 68 (5) (2000) 1127–1150. [4] Y. Sato, E. Akiyama, J. D. Farmer, Chaos in learning a simple two-person game, Proceedings of the National Academy of Sciences 99 (7) (2002) 4748– 4751. [5] T. Galla, J. D. Farmer, Complex dynamics in learning complicated games, Proceedings of the National Academy of Sciences 110 (4) (2013) 1232–1236.

17 [6] R. Nagel, Unraveling in guessing games: An experimental study, The Amer- ican Economic Review 85 (5) (1995) 1313–1326. [7] R. S. Sutton, A. G. Barto, Reinforcement learning: An introduction, MIT Press, 2018. [8] A. Rapoport, Optimal policies for the prisoner’s dilemma., Psychological Review 74 (2) (1967) 136. [9] T. W. Sandholm, R. H. Crites, Multiagent reinforcement learning in the iterated prisoner’s dilemma, Biosystems 37 (1-2) (1996) 147–166. [10] J. Hu, M. P. Wellman, Nash q-learning for general-sum stochastic games, Journal of Machine Learning Research 4 (Nov) (2003) 1039–1069. [11] M. Harper, V. Knight, M. Jones, G. Koutsovoulos, N. E. Glynatsi, O. Campbell, Reinforcement learning produces dominant strategies for the iterated prisoner’s dilemma, PloS One 12 (12) (2017) e0188046. [12] W. Barfuss, J. F. Donges, J. Kurths, Deterministic limit of temporal differ- ence reinforcement learning for stochastic games, Physical Review E 99 (4) (2019) 043305. [13] L. Busoniu, R. Babuska, B. De Schutter, A comprehensive survey of mul- tiagent reinforcement learning, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 38 (2) (2008) 156–172. [14] A. Rapoport, A. M. Chammah, C. J. Orwant, Prisoner’s dilemma: A study in conflict and cooperation, Vol. 165, University of Michigan Press, 1965. [15] Y. Usui, M. Ueda, Symmetric equilibrium of multi-agent reinforcement learning in repeated prisoner’s dilemma, Applied Mathematics and Com- putation 409 (2021) 126370. [16] J. Li, G. Kendall, The effect of memory size on the evolutionary stability of strategies in iterated prisoner’s dilemma, IEEE Transactions on Evolu- tionary Computation 18 (6) (2013) 819–826. [17] S. D. Yi, S. K. Baek, J.-K. Choi, Combination with anti-tit-for-tat remedies problems of tit-for-tat, Journal of Theoretical Biology 412 (2017) 1–7. [18] C. Hilbe, L. A. Martinez-Vaquero, K. Chatterjee, M. A. Nowak, Memory- n strategies of direct reciprocity, Proceedings of the National Academy of Sciences 114 (18) (2017) 4715–4720.

[19] Y. Murase, S. K. Baek, Seven rules to avoid the , Journal of Theoretical Biology 449 (2018) 94–102. [20] Y. Murase, S. K. Baek, Five rules for friendly rivalry in direct reciprocity, Scientific Reports 10 (2020) 16904.

18 [21] M. Ueda, Memory-two zero-determinant strategies in repeated games, Royal Society Open Science 8 (5) (2021) 202186. [22] J. W. Friedman, A non-cooperative equilibrium for supergames, The Re- view of Economic Studies 38 (1) (1971) 1–12. [23] M. Nowak, K. Sigmund, A strategy of win-stay, lose-shift that outperforms tit-for-tat in the prisoner’s dilemma game, Nature 364 (6432) (1993) 56–58.

19