Main Track AAMAS 2021, May 3-7, 2021, Online

Modeling Replicator Dynamics in Stochastic Games Using Markov Chain Method Chuang Deng Zhihai Rong Department of Automation School of Computer Science and Engineering Shanghai Jiao Tong University University of Electronic Science and Technology of China [email protected] [email protected] Lin Wang Xiaofan Wang Department of Automation School of Mechatronic Engineering and Automation Shanghai Jiao Tong University Shanghai University [email protected] [email protected]

ABSTRACT For gaining qualitative and detailed insights into the learning In stochastic games, individuals need to make decisions in multi- process in repeated games, many algorithms and models have been ple states and transitions between states influence the dynamics proposed to build the bridge between evolutionary of strategies significantly. In this work, by describing the dynamic and multi-agent reinforcement learning algorithms, such as FAQ- process in stochastic game as a Markov chain and utilizing the tran- learning [10], regret minimization [11], continuous repli- sition matrix, we introduce a new method, named state-transition cator dynamics [5], IGA [21], IGA-WOLF [2] and so on. Besides replicator dynamics, to obtain the replicator dynamics of a stochas- the assumption that individuals are in the same state, how to de- tic game. Based on our proposed model, we can gain qualitative scribe strategy dynamics with the reciprocity of multiple states and and detailed insights into the influence of transition probabilities separate strategies also raises the attention of researchers. Vrancx on the dynamics of strategies. We illustrate that a set of unbalanced et al. first investigate the problem and they combine replicator transition probabilities can help players to overcome the social dynamics and piecewise dynamics to model the learning behav- dilemmas and lead to mutual cooperation in a cooperation back ior of agents in stochastic games [25]. Hennes et al. propose a state, even if the stochastic game has the same social dilemmas in method called state-coupled replicator dynamics which couples each state. Moreover, we also present that a set of specifically de- the multiple states directly [8]. The derived algorithms from the signed transition probabilities can fix the expected payoffs of one combination of and multi-agent rein- player and make him lose the motivation to update his strategies forcement learning algorithms have been applied in multiple re- in the stochastic game. search fields, including femtocell power allocation [27], interdo- main routing price setting [24], design of social agents [6] and de- KEYWORDS sign of new multi-agent reinforcement learning algorithms [7]. In this paper, by describing the dynamic process in stochastic Multi-agent Learning; Evolutionary Game Theory; Replicator Dy- game as a Markov chain [9, 22], we propose a new method, named namics; Stochastic Games state-transition replicator dynamics, to derive a set of ordinary dif- ACM Reference Format: ferential equations to model the learning behavior of players in sto- Chuang Deng, Zhihai Rong, Lin Wang, and Xiaofan Wang. 2021. Modeling chastic games. Moreover, based on the transition matrix of Markov Replicator Dynamics in Stochastic Games Using Markov Chain Method. In chain and the derived replicator dynamics system, we demonstrate Proc. of the 20th International Conference on Autonomous Agents and Multi- that, though players face the same social dilemmas in all states, a agent Systems (AAMAS 2021), Online, May 3–7, 2021, IFAAMAS, 9 pages. set of unbalanced transition probabilities between states can lead 1 INTRODUCTION to cooperation in the cooperation back state where players have a high probability to be in after mutual cooperation. Besides, we also The leads to the questions how todrive point out that in stochastic games, there exist several sets of spe- and reinforce cooperative behaviors in social and economic sys- cific transition probabilities which can control the expected pay- tems [4, 17]. These questions have been extensively explored by offs of players. For a stateless matrix game, Press et al. [15] prove analyzing stylized uncooperative game theory models with some that if players’ decisions depend on the previous actions of players feedback mechanisms using evolutionary game theory [14, 16, 18– participating in games, one player can unilaterally sets the payoff 20]. Besides the evolutionary game theory, another method used in of the other player. In stochastic games, transition probabilities can studying the dynamics of strategies is multi-agent reinforcement take this work. Utilizing the analyzing method in stateless matrix learning [1, 3, 12]. These works mainly derive and design multi- game [13, 15, 23] and state-transition replicator dynamics model, agent reinforcement learning algorithms which can lead strategies we present that a set of specifically derived transition probabilities of players to the existing in the game. can fix the expected payoffs of one player and this player losesthe Proc. of the 20th International Conference on Autonomous Agents and Multiagent Sys- motivation to update his strategies in this stochastic game. tems (AAMAS 2021), U. Endriss, A. Nowé, F. Dignum, A. Lomuscio (eds.), May 3–7, 2021, Online. © 2021 International Foundation for Autonomous Agents and Multiagent Sys- tems (www.ifaamas.org). All rights reserved.

420 Main Track AAMAS 2021, May 3-7, 2021, Online

" �!! (� ,�) 2 BACKGROUND state 1 Agent Q state 2 Agent Q coop. defect defect 2.1 Stochastic games " coop. �!" (� ,�) P Stochastic games describe how much rewards players obtain based coop. � , � � ,� P # coop. �, � � ,� �!" (� ,�)

on their strategies in multiple states. The concept of state in sto- gent gent A defect

� ,� � ,� A chastic games refers to the environment where players interact defect �,� �,� # with each other. The current state players being in and the joint �!! (� ,�) action players taking not only determine the immediate payoffs players can receive in this round, but also the state they will stay Figure 1: The framework of a two-state two-player two- in the next round. To define a stochastic game, we need to spec- action stochastic game. There are two players, 푃 and 푄. In ify the set of players, the set of states, the set of actions that each every round, they can be in one state, 푠1 or 푠2. In each state, player can take in each state, a transition function that describes each player has a corresponding strategy and payoff matrix. how states change over time and the payoff function which de- The state in the next round depends on the current state, on scribes how players’ actions in each state affect the players’ pay- the players’ joint action and on chance, 푧푠′ (푠, 풂). offs. We follow the definition in [8] to give a definition of stochastic games. The game퐺 = ⟨푛, 푺, 훀, 풛, 휏, 휋1, ··· , 휋푛⟩ is a stochastic game with 푘 푠 푠 푛 players and 푘 states. In each state 푠 ∈ 푺 = (푠1, ··· , 푠 ), every 퐴 (퐵 ) determines the payoffs for player 푃 (푄) in state 푠. Player 푃 휏 (푠, 풂) 푎푠 푄 휏 (푠, 풂) 푏푠 player 푖 has an action set 훀푖 (푠) and strategy 휋푖 (푠). In each round, receives 푃 = 푎푃 ,푎푄 while player gets 푄 = 푎푃 ,푎푄 every player 푖 stays in one sate 푠 and chooses an action 푎푖 from for a given joint action 풂 = (푎푃 , 푎푄 ). Player 푃 has strategies 풑1 = action set 훀푖 (푠) according to strategy 휋푖 (푠). The payoff function (푝1, − 푝1)T 풑2 (푝2, − 푝2)T 푠1 푠2 Î푛 푛 1 and = 1 in state and , respec- 휏 (푠, 풂) 훀 (푠) ↦→ 푅 풂 (푎 , ··· , 푎 ) 1 2 : 푖=1 푖 maps the joint action = 1 푛 tively. Player 푄 also has two strategies, 풒 and 풒 . Since there are to an immediateÎ payoff value for each player. The transition func- only two states, the transition probabilities have the relationship, 푧(푠, 풂) 푛 훀 (푠) ↦→ Δ푘−1 1 1 2 2 tion : 푖=1 푖 determines the probabilistic 푧푠1 (푠 , 풂) = 1 − 푧푠2 (푠 , 풂) and 푧푠2 (푠 , 풂) = 1 − 푧푠1 (푠 , 풂). The 푘− state transition, where Δ 1 is the (푘 − 1)-simplex and 푧푠′ (푠, 풂) is framework of a two-state two-player two-action stochastic game ′ the transition probability from state 푠 to 푠 under joint action 풂. is shown in Figure 1. In this work, we also follow the restriction proposed in [8] that all states 푠 ∈ 푺 are in an ergodic set. This restriction ensures that 2.3 Learning automata the game has no absorbing states. In this section, we give a description on how player can update his strategy with multi-agent reinforcement learning method [26]. A 2.2 Two-state two-player two-action stochastic player can be seen as a learning automata, who updates his strat- games egy by monitoring the reinforcement signal resulted from the ac- Here, we introduce the simplest version of stochastic games: two- tion he has chosen in this round. An automata wants to learn the state two-player two-action stochastic game. optimal strategy and maximizes the expected payoffs. In games containing two players, player 푃 (푄) has a strategy 풑 In this paper, we focus on finite action-set learning automata (풒). Each player chooses between only two actions, cooperation (푐) (FALA) [8]. At the beginning of a round, every player draws a and defection (푑). Thus, we can use a probability to fully define one random action 푎(푡) from his action set according to his strategy player’s strategy. For player 푃, the parameter 푝 means the proba- 흅 (푡). Based on the action 푎(푡), the environment responds with a re- bility of cooperation and 풑 = (푝, 1 − 푝)T. Similarly, for player 푄, ward 휏 (푡). The automata uses this reward to update his strategy to 풒 = (푞, 1 − 푞)T. The payoff function can be represented byabi- 흅 (푡 +1). The update rule for FALA using the linear reward-inaction matrix (퐴, 퐵), that gives the payoffs for the row player푃 ( ) in 퐴, (퐿푅−퐼 ) scheme is given below, and the column player (푄) in 퐵, which is shown as follow, ( ∗   훼휏 (푡) (1 − 휋푎∗ (푡)) if 푎(푡) = 푎 휋푎∗ (푡 + 1) = 휋푎∗ (푡) + (3) 푎푐푐,푏푐푐 푎푐푑,푏푐푑 − 훼휏 (푡)휋 ∗ (푡) , (퐴, 퐵) = . (1) 푎 otherwise 푎푑푐,푏푑푐 푎푑푑,푏푑푑 ∗ where 휏 ∈ [0, 1]. 휋푎∗ means the probability of taking action 푎 . The The of players’ joint action determines the payoffs toboth parameter 훼 ∈ [0, 1] determines the learning rate. players. It follows the order of actions of player 푃 and 푄, 풂 = In multi-state games, a player associates a dedicated learning (푎푃 , 푎푄 ), where 푎푃 (푎푄 ) is the action taken by player 푃 (푄). automata to each state of the game. One LA tries to optimize the When a stochastic game has two states, 푠1 and 푠2, the payoff strategy in one state using the standard update rule given in Equa- matrices are given by tion (3). Only a single LA is active and selects an action in each   푎1 ,푏1 푎1 ,푏1 round of the game. However, the immediate reward from the en- (퐴1, 퐵1) 푐푐 푐푐 푐푑 푐푑 , = 푎1 ,푏1 푎1 ,푏1 푟 vironment is not directly fed back to this LA. Instead, when the 푑푐 푑푐 푑푑 푑푑   (2) LA becomes active again, i.e., next time the same state is played, 푎2 ,푏2 푎2 ,푏2 (퐴2, 퐵2) 푐푐 푐푐 푐푑 푐푑 . it is informed about the cumulative reward gathered since the last = 푎2 ,푏2 푎2 ,푏2 푑푐 푑푐 푑푑 푑푑 activation and the time that has passed by [8].

421 Main Track AAMAS 2021, May 3-7, 2021, Online

The reward 휏푖 (푡) for agent 푖’s automaton LA푖 (푠) associated with state in stochastic games is a critical point to derive the replicator state 푠 is defined as dynamics in stochastic games. Í푡− 1 푟푖 (푙) Δ푟푖 푙=푡 (푠) 휏푖 (푡) = = 0 , (4) 3.1 State-coupled replicator dynamics Δ푡 푡 − 푡0 (푠) In work [8], the authors propose an approach named state-coupled where 푟푖 (푙) is the immediate reward for agent 푖 in round 푙 and 푡 (푠) 0 replicator dynamics to derive the replicator dynamics in stochastic is the last occurrence function and determines when state 푠 was games. In their work, they first need to get an average reward game visited last. The reward feedback in round 푡 equals the cumulative of each state. For a stochastic game 퐺 = ⟨푛, 푺, 훀, 푧, 휏, 휋1, ··· , 휋푛⟩ reward Δ푟푖 divided by time-frame Δ푡. The cumulative reward Δ푟푖 and state 푠 ∈ 푺, the average reward game is defined as is the sum over all immediate rewards gathered in all states be- 푡 (푠) 푡 − 1 ′ ′ ginning with round 0 and including the last round . The 퐺¯ (푠, 휋1 . . . 휋푛) = 푛, 훀1 (푠) ... 훀푛 (푠), 휏,¯ 휋1 (푠 ) . . . 휋푛 (푠 ) (10) time-frame Δ푡 measures the number of rounds that have passed ′ ′ since automaton LA푖 (푠) has been active last. This means the state where each player 푖 plays a fixed strategy 휋푖 (푠 ) in all states 푠 ≠ 푠. strategy is updated using the average reward over the interim im- The payoff function 휏¯푖 is given by mediate rewards. Õ ′ 휏¯푖 (푠, 풂) = 푥푠 (푠, 풂)휏푖 (푠, 풂) + 푥푠′ (푠, 풂)퐹푖 푠 (11) 2.4 Replicator dynamics 푠′ ∈푆−{푠 } From the perspective of evolutionary game theory, the dynamics of where strategies can be formulated as a system of differential equations. ! Each replicator represents one strategy. Strategies that gain above- Õ Ö푛 ′ ′ ′ ′ 퐹푖 푠 휏푖 푠 , 풂 휋 ′ 푠 . average payoff become more likely to be played [25]. The dynamics = 푖,푎푖 (12) ∗ Î푛 푖 푎 휋 ∗ 풂′ ∈ 훀 (푠′) 푖 1 of strategy that player takes action , 푖,푎 , can be written as 푖=1 푖 =

푑휋 ∗ 푖,푎 푥 ∗ (푠, 풂) = (푓푖,푎∗ − 푓푖,휋 )휋푖,푎∗ , (5) In Equation (11), 푠 is a stationary distribution over all states 푑푡 푺, given the joint action 풂 being played in state 푠. The distribution ∗ Í where 푓푖,푎∗ is the payoff when player 푖 takes action 푎 and 푓푖,휋 is meets the condition that 푠∗ ∈푺 푥푠∗ (푠, 풂) = 1 and the average payoff when player 푖 applies strategy 휋. Õ ′ In a two-player two-action game, the expected payoffs of coop- 푥푠∗ (푠, 풂) = 푥푠 (푠, 풂)푞푠푠∗ (푠, 풂) + 푥푠′ (푠, 풂)푍푖 푠 , (13) eration and defection of player 푃 can be written as 푠′ ∈푺−{푠 } 푓 (푐, 풒) = 푎푐푐푞 + 푎푐푑 (1 − 푞) = (퐴풒)푐, (6) where 푓 (푑, 풒) = 푎 푞 + 푎 (1 − 푞) = (퐴풒) . ! 푑푐 푑푑 푑 Õ Ö푛 ′ ′ ′ ′ and similarly we can write the expected payoff of strategy 풑 as 푍푖 푠 = 푧푠∗ 푠 , 풂 휋푖,푎′ 푠 . (14) Î 푖 ′ 푛 ′ 풂 ∈ 훀푖 (푠 ) 푖=1 푓 (풑, 풒) = 푝푓 (푐, 풒) + (1 − 푝)푓 (푑, 풒) = 풑T퐴풒. (7) 푖=1 Thus, the two-player two-action replicator dynamics can be de- Given the average reward games of each state, the state-coupled fined as the following system of ordinary differential equations replicator dynamics are defined by the following system of differ- 푑푝 h i ential equations: = (퐴풒)푐 − 풑T퐴풒 푝, 푑푡 푑휋푖,푗 (푠)    푑푞 h i (8) = 푓푖 푠, 풆 푗 − 푓푖 (푠, 휋푖 (푠)) 휋푖,푗 푥푠 (휋) (15) (퐵T풑) − 풒T퐵T풑 푞. 푑푡 푑푡 = 푐 풆 푗푡ℎ 푖 As there are only two actions for each player, the replicator dy- where 푗 is the -unit vector which means the case that player 푗 푓 (푠, 휔) 푖 namics can also be written as takes action . 푖 is the expected payoff for a player playing 휔 푠 푓 푑푝 strategy in state . 푖 is defined as = [(퐴풒)푐 − (퐴풒) ] (1 − 푝)푝, 푑푡 푑  ! h i (9) Õ  Õ Ö  푑푞   T T 푓푖 (푠, 휔) = 휔 푗 휏¯푖 (푠, 풂) 휋푙,푎 (푠) (16) = (퐵 풑)푐 − (퐵 풑)푑 (1 − 푞)푞.  Î 푙  푑푡 푗 ∈훀 (푠)  풂∈ 훀 (푠) 푙 푖  푖  푙≠푖 푙 ≠  3 REPLICATOR DYNAMICS IN STOCHASTIC where joint action 풂 = (푎1, ··· , 푎푖−1, 푗, 푎푖+1, ··· , 푎푛). 푥 is the sta- GAMES tionary distribution over all states under strategy 휋, with If the game has only one state, the joint action results in a deter- Õ ministic outcome and brings a deterministic reward to the play- 푥푠 (휋) = 1 and (17) ers. However, in a stochastic game, as the joint action influences 푠 ∈푆  ! the next state players being in, the average reward of joint action Õ  Õ Ö푛   ∗ ∗  in one state is affected by the future rewards players can receive. 푥푠 (휋) = 푥푠∗ (휋) 푧푠 (푠 , 풂) 휋푖,푎 (푠 ) (18)  Î 푖  푠∗ ∈푺  풂∈ 푛 훀 (푠∗) 푖 1  Thus, how to evaluate the payoff of an action and a strategy inone  푖=1 푖 = 

422 Main Track AAMAS 2021, May 3-7, 2021, Online

푖 ∗ ∗ 3.2 State-transition replicator dynamics 휏¯ (푠 , 풂 ) is given by Õ In the derivation of state-coupled replicator dynamics, one critical ∗ ∗ ∗ 휏¯푖 (푠 , 풂 ) = 푣 (푠,풂) · 휏푖 (푠, 풂). (22) step is to get the expected payoff 푓푖 (푠, 풂) for the joint action 풂 taken Î푛 푠 ∈푺,풂∈ 푙 훀푙 (푠) in state 푠. The expected payoff is given by adding the immediate =1 payoff for joint action 풂 in state 푠 with the expected payoffs in all Given the average reward games, we can use Equations (15) other states. Expected payoffs are then weighted by the frequency and (16) to derive the system of replicator dynamics. The stationary distribution over all states under strategy 휋 equals to of corresponding state occurrences. As calculating the expected Õ payoffs in other states and the frequency of state occurrences, the 푥푠 (휋) = 푣 (푠,풂) (23) strategies in other states are assumed to be fixed. Î푛 풂∈ 푖 훀푖 (푠) In this paper, we propose a new approach to derive the average =1 reward game and replicator dynamics in stochastic games, by the 3.3 Results of strategy trajectory traces means of describing the dynamic process in stochastic game as In this section, we plot multiple strategy trajectory traces origi- a Markov chain. As we use the properties of Markov chain and nating from learning automata as well as state-coupled and state- transition matrix in the derivation of replicator dynamics, we call transition replicator dynamics based on the stochastic game pre- this approach as state-transition replicator dynamics. sented in [8]. These results illustrate that state-transition replicator One outcome of the Markov chain for stochastic game is the dynamics can model the learning process precisely. combination of state 푠 and the joint action 풂 in this state, which is 푂 푁 The payoff matrices of a two-state two-player two-action sto- writtenÍ Î as (푠,풂) . In total, the Markov chain has number of = 푛 |훀 (푠)| 퐺 chastic game are given by 푠 ∈푺 푖=1 푖 possible outcomes. For a stochastic game =         ⟨푛, 푺, 훀, 푧, 휏, 휋 , ··· , 휋푛⟩, transitions between outcomes in the Markov 3, 3 0, 10 4, 4 0, 10 1 퐴1, 퐵1 = , 퐴2, 퐵2 = . chain can be represented by a transition matrix 푀 (퐺) with the en- 10, 0 2, 2 10, 0 1, 1 tries equaling to (24) The transition probabilities are defined as Ö푛 ′ ′ 푧 ′ (푠, 푐푐) 푧 ′ (푠,푑푑) . , 푃 (푋푡+ = 푂 (푠,풂) |푋푡 = 푂 (푠′,풂′) ) = 푧푠 (푠 , 풂 ) 휋푖,푎 (푠). (19) 푠 = 푠 = 0 1 1 푖 (25) 푖=1 푧푠′ (푠, 푐푑) = 푧푠′ (푠,푑푐) = 0.9, ′ ′ The entry of the transition matrix means the probability that the where 푠, 푠 ∈ 푺 and 푠 ≠ 푠 . Player 푃 (푄) has strategies 풑1 (풒1) outcome moves from 푂 (푠′,풂′) to 푂 (푠,풂) . This probability is deter- and 풑2 (풒2) in state 푠1 and 푠2, respectively. The pure stationary mined by the multiplication of two parts. One is the state transition equilibrium reflects the strategies where one of the players defects 푧 (푠′, 풂′)) probability, 푠 . The other is the probabilityÎ of joint action in one state while cooperating in other and the second player does 풂 푠 푛 휋 (푠) taking place in state , which is written as 푖=1 푖,푎푖 , where exactly the opposite. 푎푖 ∈ 풂 Figure 2 presents trajectory plots for this stochastic game. The We define the vector 풗(퐺) as the left eigenvector of the transi- state-transition replicator dynamics can model the learning dynam- tion matrix 푀 (퐺) corresponding to the eigenvalue 1. This vector ics precisely, as well as the state-coupled replicator dynamics pro- represents the stationary distribution of outcomes in the Markov posed in [8]. Both of these two methods describe the coupling be- chain. The entry 푣 (푠,풂) of this vector is the expected frequencies to tween states. The main difference in these two methods is theap- observe the outcome 푂 (푠,풂) over the course of the stochastic game. proach to obtain the average reward game and the stationary dis- Furthermore, we know the immediate payoffs for players given a tribution over all states under strategy 휋. State-coupled method state 푠 and the joint action 풂. Thus, for stochastic game 퐺, the ex- copes with these problems by building a system of multi-variable pected payoff of player 푖 can then be computed by equations and solving it. The state-transition replicator dynamics Õ method proposed in this paper solves these problems by viewing 푓푖 (퐺) = 푣 (푠,풂) · 휏푖 (푠, 풂). (20) Î the dynamic process in stochastic games as a Markov chain. Then, 푠 ∈푺,풂∈ 푛 훀 (푠) 푙=1 푙 the average reward game and the stationary distribution over states can be obtained through calculating the stationary vector of the For obtaining the average reward game, we need to calculate transition matrix. the expected payoffs of the combinations of every state and every joint action. The joint action in state 푠 can be seen as that each 4 APPLICATIONS player takes a pure strategy in this state. Given that players take ∗ ∗ ∗ By viewing the stochastic game in the perspective of Markov chain the joint action 풂 in state 푠 , the transition matrix 푀 (퐺) for this and utilizing the properties of transition matrix, we can get a pow- combination of state and joint action can be written as erful tool to analyze and design some specific stochastic games.  ∗ ∗  0 if 푠 = 푠 , 풂 ≠ 풂 ′ ′ ∗ ∗ 푃 (푂 |푂 ′ ′ ) 푧 (푠 , 풂 ) 푠 푠 , 풂 풂 4.1 Unbalanced transition probabilities (푠,풂) (푠 ,풂 ) =  푠 Î if = =  푧 (푠′, 풂′) 푛 휋푖 (푠) 푠 푠∗ between states  푠 푖=1 푎푖 if ≠ (21) We first illustrate that how transition probabilities between states ∗ With transition matrix 푀 (퐺), we can get the stationary distri- influence the dynamics of strategies and promote cooperation in 풗∗ (퐺) 푣∗ bution with entries as (푠,풂) . The expected payoff function social dilemmas. It is assumed that players take part in three alike

423 Main Track AAMAS 2021, May 3-7, 2021, Online

can not cover the loss of leaving it. However, there still exist some FALA, state 1 FALA, state 2 a 1.0 b 1.0 initialization conditions with that players turn to be mutually de- fective in state 푠1. With the help of state-transition replicator dy- 0.8 0.8 namics model, we can investigate what initialization conditions 0.6 0.6 that players need to meet for the prevalence of cooperation. 1 2 q q 푖 휃 0.4 0.4 For player in a stochastic game, we define the vector 푖 = 푘 (푎1, ··· , 푎 ) as the combination of actions taken by player 푖 in each 0.2 0.2 state 푠. For player 푖 and state 푠, by setting the strategy as follow, 0.0 0.0  0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 푠 1 2 1 if 푎 = 푎 p p 휋푖,푎 (푠) = 푠 (27) 0 if 푎 ≠ 푎 , State-coupled RD, state 1 State-coupled RD, state 2 c 1.0 d 1.0 and substituting it into the Equations (19) and (20), we can get the 0.8 0.8 expected payoffs for player 푖 with combination of actions 휃푖 against 0.6 0.6 other players’ strategies. 1 2 q q In stochastic game defined in this section, players have two ac- 0.4 0.4 tions, 푐 and 푑, in three states. Through enumerating all the possible 0.2 0.2 combinations of actions for player 푃 and 푄, we can get an expected

0.0 0.0 payoff matrix shown as below, 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 p1 p2 (c, c, c) (c, c, d) (c, d, c) (c, d, d) State-transition RD, state 1 State-transition RD, state 2 e 1.0 f 1.0 (c, c, c) 3.0, 3.0 2.83, 3.08 2.83, 3.08 2.0, 3.5 © (c, c, d) ­3.08, 2.83 2.92, 2.92 2.75, 2.75 2.25, 3.0 0.8 0.8 ­ (c, d, c) ­3.08, 2.83 2.75, 2.75 2.92, 2.92 2.25, 3.0 ­ 0.6 0.6 (c, d, d) ­ 3.5, 2.0 3.0, 2.25 3.0, 2.25 2.5, 2.5 1 2

q q ­ (d, c, c) ­ 3.5, 2.0 2.71, 2.61 2.71, 2.61 1.3, 3.7 0.4 0.4 ­ (d, c, d) ­3.68, 1.64 3.04, 1.96 2.65, 2.35 1.75, 2.8 0.2 0.2 ­ (d, d, c) 3.68, 1.64 2.65, 2.35 3.04, 1.96 1.75, 2.8 « 0.0 0.0 (d, d, d) 4.0, 1.0 3.1, 1.45 3.1, 1.45 2.2, 1.9 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 p1 p2 . 2.0, 3.5 1.64, 3.68 1.64, 3.68 1.0, 4.0 ª 2.61, 2.71 1.96, 3.04 2.35, 2.65 1.45, 3.1® Figure 2: The strategy trajectory traces of learning automata, ® 2.61, 2.71 2.35, 2.65 1.97, 3.04 1.45, 3.1® state-coupled and state-transition replicator dynamics. Ini- ® 3.7, 1.3 2.8, 1.75 2.8, 1.75 1.9, 2.2 ® tial strategy profiles are picked randomly in state 1 and state ® 2.5, 2.5 2.0, 2.97 2.0, 2.97 1.1, 3.8 ® 2 at start. ® 2.96, 2.0 2.32, 2.32 2.45, 2.45 1.55, 2.9® 2.96, 2.0 2.45, 2.45 2.32, 2.32 1.55, 2.9® . , . . , . . , . . , . ¬ Prisoner’s Dilemma games in a stochastic game. The payoff matri- 3 8 1 1 2 9 1 55 2 9 1 55 2 0 2 0 (푑, 푐, 푐)(푑, 푐,푑)(푑,푑, 푐)(푑,푑,푑) ces in each state are set as   3, 3 1, 4 (퐴1, 퐵1) = (퐴2, 퐵2) = (퐴3, 퐵3) = (26) 4, 1 2, 2 The entries of the matrix represent the expected payoffs for player 푃 and 푄 with combination of actions 휃푃 and 휃푄 . The first value in The difference between these three states is defined by thetran- each entry represents the payoff for row player 푃 and the second sition probabilities. State 1 is assumed to be a cooperation back one for column player 푄. state. If players cooperate mutually, they have a high probability We can find that there exist two pure Nash equilibria, ((푐,푑,푑), to be in state 1 in the next round. Otherwise, they have a high prob- (푐,푑,푑)) and ((푑,푑,푑), (푑,푑,푑)). That is to say, for both of players, ability to stay in state 2 or state 3. Here, the transition probabilities the best actions in states 푠2 and 푠3 are defection. However, in state are set as 푧푠1 (푠, 푐푐) = 0.9, 푧푠1 (푠, 풂) = 0.1, 푧푠2 (푠, 푐푐) = 푧푠3 (푠, 푐푐) = 푠1, an unstable equilibrium point exists in the perspective of evo- (1−푧푠1 (푠, 푐푐))/2 and 푧푠2 (푠, 풂) = 푧푠3 (푠, 풂) = (1−푧푠1 (푠, 풂))/2, where Î푛 푖 lutionary game theory. By eliminating the dominated actions, we 푠 ∈ 푺 and 풂 ∈ 훀 (푠) \(푐푐). 푖=1 can get a sub payoff matrix shown as follow, Figure 3 plots the strategy trajectory traces originating from (푐,푑,푑)(푑,푑,푑) learning automata and state-transition replicator dynamics in the   stochastic game with unbalanced transition probabilities. It can be (푐,푑,푑) 2.5, 2.5 1.9, 2.2 . (28) observed that, for most of initialization conditions shown in Fig- (푑,푑,푑) 2.2, 1.9 2.0, 2.0 ure 3, both players 푃 and 푄 converge to cooperation in state 푠1 and tend to defect mutually in states 푠2 and 푠3. The mutual defection For player 푃, the expected payoff of cooperation in state 푠1, in states 푠2 and 푠3 can be seen as a punishment for players if they 푓 (푐, 풒1), equals 2.5푞1 + 1.9(1 − 푞1) and the expected payoff of de- choose defection state 푠1. The temptation of defection in state 푠1 fection, 푓 (푑, 풒1), equals 2.2푞1 +2.0(1−푞1). The dynamic direction

424 Main Track AAMAS 2021, May 3-7, 2021, Online

FALA, state 1 FALA, state 2 FALA, state 3 a 1.0 b 1.0 c 1.0

0.8 0.8 0.8

0.6 0.6 0.6 1 2 1 q q q 0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 p1 p2 p3

State-transition RD, state 1 State-transition RD, state 2 State-transition RD, state 3 d 1.0 e 1.0 f 1.0

0.8 0.8 0.8

0.6 0.6 0.6 1 2 3 q q q 0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 p1 p2 p3

Figure 3: The strategy trajectory traces of learning automata and state-transition replicator dynamics in stochastic gameswith unbalanced transition probabilities. Initial strategy profiles are picked randomly in states 푠1, 푠2 and 푠3 at start. The shadow areas in panels (a) and (d) illustrate the area into where strategies 푝1 and 푞1 turn to be 0 when they fall. of strategy 푝1 is determined by replicator dynamics, we demonstrate that a set of specifically de- rived transition probabilities in stochastic games can control the 푓 (푐, 풒1) − 푓 (푑, 풒1) = 0.4푞1 − 0.1. (29) payoffs of players. 1 2 There exists a value 푞ˆ1 = 0.25 leading 푓 (푐, 풒1) − 푓 (푑, 풒1) to be 0. For a two-state (푠 , 푠 ) two-player (푃, 푄) two-action (푐,푑) sto- When the strategy 푞1 is larger than 푞ˆ1, player 푃 would like to be chastic game 퐺, the whole outcomes of the Markov chain can be 1 1 1 1 2 cooperative in state 푠1, and vice versa. In Figure 3 (a) and (d), there written as a vector, 푶 = ((푠 , 푐푐), (푠 , 푐푑), (푠 ,푑푐), (푠 ,푑푑), (푠 , 푐푐), 2 2 2 exists a shadow area where 푝1 < 0.25 and 푞1 < 0.25. If strategies (푠 , 푐푑), (푠 ,푑푐), (푠 ,푑푑)). For simplicity, we define that the out- 푝1 and 푞1 initially fall into this area or evolve into this area, both comes in 푶 are labeled 1, ··· , 8 following the order in the vector. 1 players 푃 and 푄 would like to change to be defective. The probabilities of the next state being 푠 for the respective out- Figure 4 illustrates the relationship between 푑푝1/푑푡 and 푞1 for comes are given by a vector 풛 = (푧1, 푧2, 푧3, 푧4, 푧5, 푧6, 푧7, 푧8). 풗 is the 푀 different parameters 푧푠1 (푠, 푐푐) when the strategies of players 푃 and stationary vector of transition matrix defined in Equation (19). 푄 푠2 푠3 풗 are purely defective in state andÎ . Meanwhile, the parame- The vector satisfies 푧 (푠, 풂) 풂 ∈ 푛 훀푖 (푠) \(푐푐) ′ ter 푠1 is fixed as 0.1 for 푖=1 . We can find 풗푀 = 풗 or 풗푀 = 0, (30) 1 that the unstable equilibrium point 푞ˆ turns to be smaller with the ′ where 푀 = 푀 −퐼. 퐼 is the identity matrix with the same dimension decrease of 푧푠1 (푠, 푐푐). When 푧푠1 (푠, 푐푐) becomes 0.5, the unstable 1 of 푀. Because the transition matrix 푀 is stochastic and has a unit equilibrium point vanishes. The strategy 푝 turns to be defective ′ eigenvalue, the matrix 푀 is singular, with thus zero determinant. no matter what strategy 푞1 is. Further, by Cramer’s rule, we have ′ ′ ′ 4.2 Control the payoffs of players 퐴푑 푗 (푀 )푀 = 푑푒푡 (푀 )퐼 = 0, (31) ′ ′ In this section, by extending the analysis method proposed in [15] where 0 is a zero matrix with the same dimensions of 푀 . 퐴푑 푗 (푀 ) ′ ∗ from stateless game to stochastic game and utilizing the state-transition is the adjoint matrix of 푀 . Let푚푖 푗 be the (푖, 푗) entry of the cofactor

425 Main Track AAMAS 2021, May 3-7, 2021, Online

By adding the first column, the second column and the third z 1(s, cc) = 0.9 ′ 0.012 s column to the fourth column in the matrix 푀 , we can get zs1(s, cc) = 0.7     z 1(s, cc) = 0.5 푧 푝1푞1 − 푧 푝1 − 푞1 푧 − 푝1 푞1 푧 − ( − 푧 ) 푝2푞2 s  1 1 1 1  1 1  1 1 1 1 0.010  푧 푝1푞1 푧 푝1 − 푞1 − 푧 − 푝1 푞1 푧 − ( − 푧 ) 푝2푞2  2 2 1  1 2 1  2 1 1 2  푧 푝1푞1 푧 푝1 − 푞1 푧 − 푝1 푞1 − 푧 − ( − 푧 ) 푝2푞2 0.008  3 3 1  3 1  1 3 1 1 3  푧 푝1푞1 푧 푝1 − 푞1 푧 − 푝1 푞1 푧 − ( − 푧 ) 푝2푞2  4 4 1  4 1  4 1 1 4 t  퐻 푧 푝1푞1 푧 푝1 − 푞1 푧 − 푝1 푞1 푧 ( − 푧 ) 푝2푞2 −

d 0.006 =  5 5 1 5 1 5 1 5 1 /      1 푧 푝1푞1 푧 푝1 − 푞1 푧 − 푝1 푞1 푧 ( − 푧 ) 푝2푞2  6 6 1  6 1  6 1 6 p  0.004 푧 푝1푞1 푧 푝1 − 푞1 푧 − 푝1 푞1 푧 ( − 푧 ) 푝2푞2 d  7 7 1  7 1  7 1 7  1 1 1 1 1 1 2 2 푧8푝 푞 푧8푝 1 − 푞 푧8 1 − 푝 푞 푧8 (1 − 푧8 ) 푝 푞 0.002         ( − 푧 ) 푝2 − 푞2 ( − 푧 ) − 푝2 푞2 ( − 푧 ) − 푝2 − 푞2  0.000 1 1 1  1 1 1  1 1 1  1    ( − 푧 ) 푝2 − 푞2 ( − 푧 ) − 푝2 푞2 ( − 푧 ) − 푝2 − 푞2  1 2 1  1 2 1  1 2 1  1    −0.002 ( − 푧 ) 푝2 − 푞2 ( − 푧 ) − 푝2 푞2 ( − 푧 ) − 푝2 − 푞2  1 3 1  1 3 1  1 3 1  1    ( − 푧 ) 푝2 − 푞2 ( − 푧 ) − 푝2 푞2 ( − 푧 ) − 푝2 − 푞2  1 4 1  1 4 1  1 4 1  1   0.0 0.2 0.4 0.6 0.8 1.0  ( − 푧 ) 푝2 − 푞2 ( − 푧 ) − 푝2 푞2 ( − 푧 ) − 푝2 − 푞2  . 1 5 1  1 5 1  1 5 1  1   1  q ( − 푧 ) 푝2 − 푞2 − ( − 푧 ) − 푝2 푞2 ( − 푧 ) − 푝2 − 푞2  1 6 1  1 1 6 1  1 6 1  1    ( − 푧 ) 푝2 − 푞2 ( − 푧 ) − 푝2 푞2 − ( − 푧 ) − 푝2 − 푞2  1 7 1  1 7 1  1 1 7 1  1    (1 − 푧 ) 푝2 1 − 푞2 (1 − 푧 ) 1 − 푝2 푞2 (1 − 푧 ) 1 − 푝2 1 − 푞2 − 1 Figure 4: Gradient of strategy 푝1 as a function of the strategy 8 8 8 1 푞 for different parameters 푧푠1 (푠, 푐푐). The strategies of play- (33) 푃 푄 푠2 푠3 ers and in state and are fixed asÎ purely defective. where the transpose of the fourth column is denoted by ˜풛 ≡ (푧1−1, 푧 (푠, 풂) 풂 ∈ 푛 훀푖 (푠) \(푐푐) 푧 − , 푧 − , 푧 − , 푧 , 푧 , 푧 , 푧 ) 풛 The parameter 푠1 is fixed as 0.1 for 푖=1 . 2 1 3 1 4 1 5 6 7 8 . It is clear that ˜ depends only The crosses represent the unstable equilibrium points where on the transition probabilities, but not on players’ strategies. 1 1 ∗ 푑푝 /푑푡 equals 0. The value of 푝 can influence the value of Let ℎ푖 푗 be the (푖, 푗) entry of the cofactor matrix of 퐻. Since the 푑푝1/푑푡, but can not move the unstable equilibrium points. In operations of additions of columns do not change the determinant 푝1 푧 (푠, 푐푐) . ℎ∗ 푚∗ 푖 , ··· , this figure, we fix as 0.5. The left cross is for 푠1 = 0 9 of the matrix, we have 푖8 = 푖8 for = 1 8. Through replac- 1 퐻 and 푞ˆ equals 0.25. The right cross is for 푧푠1 (푠, 푐푐) = 0.7 and ing the eighth column of by the transpose of an arbitrary eight- 휼 (휂 , 휂 , 휂 , 휂 , 휂 , 휂 , 휂 , 휂 ) 푞ˆ1 equals 2/3. dimensional vector = 1 2 3 4 5 6 7 8 , the value of the determinant of the corresponding matrix can be computed by expanding along the eighth column as follow,     푧 푝1푞1 − 푧 푝1 − 푞1 푧 − 푝1 푞1 푧 −  1 1 1 1  1 1  1 1  푧 푝1푞1 푧 푝1 − 푞1 − 푧 − 푝1 푞1 푧 −  2 2 1  1 2 1  2 1  푧 푝1푞1 푧 푝1 − 푞1 푧 − 푝1 푞1 − 푧 −  3 3 1  3 1  1 3 1  푧 푝1푞1 푧 푝1 − 푞1 푧 − 푝1 푞1 푧 −  4 4 1  4 1  4 1  푑푒푡 푧 푝1푞1 푧 푝1 − 푞1 푧 − 푝1 푞1 푧  5 5 1  5 1  5  푧 푝1푞1 푧 푝1 − 푞1 푧 − 푝1 푞1 푧  6 6 1  6 1  6  푧 푝1푞1 푧 푝1 − 푞1 푧 − 푝1 푞1 푧  7 7 1  7 1  7  1 1 1 1 1 1 푧8푝 푞 푧8푝 1 − 푞 푧8 1 − 푝 푞 푧8

    ( − 푧 ) 푝2푞2 ( − 푧 ) 푝2 − 푞2 ( − 푧 ) − 푝2 푞2 휂  (34) 1 1 1 1 1  1 1 1  1  ( − 푧 ) 푝2푞2 ( − 푧 ) 푝2 − 푞2 ( − 푧 ) − 푝2 푞2 휂  1 2 1 2 1  1 2 1  2 ′ ′  matrix of 푀 , then 퐴푑 푗 (푀 ) can be expressed as follow, ( − 푧 ) 푝2푞2 ( − 푧 ) 푝2 − 푞2 ( − 푧 ) − 푝2 푞2 휂  1 3 1 3 1  1 3 1  3  ( − 푧 ) 푝2푞2 ( − 푧 ) 푝2 − 푞2 ( − 푧 ) − 푝2 푞2 휂  1 4 1 4 1  1 4 1  4  ( − 푧 ) 푝2푞2 − ( − 푧 ) 푝2 − 푞2 ( − 푧 ) − 푝2 푞2 휂  1 5 1 1 5 1  1 5 1  5  ( − 푧 ) 푝2푞2 ( − 푧 ) 푝2 − 푞2 − ( − 푧 ) − 푝2 푞2 휂  1 6 1 6 1  1 1 6 1  6  ( − 푧 ) 푝2푞2 ( − 푧 ) 푝2 − 푞2 ( − 푧 ) − 푝2 푞2 − 휂  1 7 1 7 1  1 7 1  1 7 2 2 2 2 2 2  (1 − 푧8 ) 푝 푞 (1 − 푧8 ) 푝 1 − 푞 (1 − 푧8 ) 1 − 푝 푞 휂8  푚∗ 푚∗ 푚∗ 푚∗ 푚∗ 푚∗ 푚∗ 푚∗   11∗ 21∗ 31∗ 41∗ 51∗ 61∗ 71∗ 81∗   푚 푚 푚 푚 푚 푚 푚 푚  휂 ℎ∗ + 휂 ℎ∗ + 휂 ℎ∗ + 휂 ℎ∗ + 휂 ℎ∗ + 휂 ℎ∗ + 휂 ℎ∗ + 휂 ℎ∗ .   = 1 18 2 28 3 38 4 48 5 58 6 68 7 78 8 88  12∗ 22∗ 32∗ 42∗ 52∗ 62∗ 72∗ 82∗   푚 푚 푚 푚 푚 푚 푚 푚  ∗ ∗ ∗ ∗ ∗  13∗ 23∗ 33∗ 43∗ 53∗ 63∗ 73∗ 83∗  Recalling that ℎ푖 = 푚푖 for 푖 = 1, ··· , 8 and 풗 = 휌(푚 ,푚 ,푚 , ′  푚 푚 푚 푚 푚 푚 푚 푚  ∗ ∗ ∗ 8∗ ∗8 18 28 38 퐴푑 푗 (푀 ) =  14∗ 24∗ 34∗ 44∗ 54∗ 64∗ 74∗ 84∗  . 푚 ,푚 ,푚 ,푚 ,푚 ) for 휌 ≠ 0, then Equation (34) implies  푚 푚 푚 푚 푚 푚 푚 푚  48 58∗ 68 ∗ 78 88∗ ∗ ∗ ∗ ∗ ∗  15∗ 25∗ 35∗ 45∗ 55∗ 65∗ 75∗ 85∗  휂 ℎ +휂 ℎ +휂 ℎ +휂 ℎ +휂 ℎ +휂 ℎ +휂 ℎ +휂 ℎ  푚 푚 푚 푚 푚 푚 푚 푚  that 1 18 2 28 3 38 4 48 5 58 6 68 7 78 8 88 =  16∗ 26∗ 36∗ 46∗ 56∗ 66∗ 76∗ 86∗  1  푚 푚 푚 푚 푚 푚 푚 푚  휌 (풗 · 휼). Then, we can have  17 27 37 47 57 67 77 87   푚∗ 푚∗ 푚∗ 푚∗ 푚∗ 푚∗ 푚∗ 푚∗  18 28 38 48 58 68 78 88 풗 · 휼 = 휌퐷(풛, 풑ˆ, 풒ˆ, 휼) (32) (35) where 퐷(풛, 풑ˆ, 풒ˆ, 휼) equals to the determinant value defined in Equa- ′ From Equations (30) and (31), every row of 퐴푑 푗 (푀 ) is propor- tion (34). 풑ˆ = (푝1, 푝2) (풒ˆ = (푝1, 푞2)) represents the combination of 풗 풗 휌(푚∗ ,푚∗ ,푚∗ ,푚∗ ,푚∗ ,푚∗ , 푃 푄 푠1 푠2 푃 tional to . Thus, we have = 18 28 38 48 58 68 strategies of player ( ) in state and . Player ’s payoff vector 푚∗ ,푚∗ ) 휌 푺 (푎1 , 푎1 , 푎1 , 푎1 , 푎2 , 푎2 , 푎2 , 푎2 ) 푄 78 88 for some ≠ 0. is 푃 = 푐푐 푐푑 푑푐 푑푑 푐푐 푐푑 푑푐 푑푑 , whereas player ’s

426 Main Track AAMAS 2021, May 3-7, 2021, Online

State-transition RD, state 1 State-transition RD, state 2 change his strategies, but keeps his initialized strategies. Mean- a 1.0 b 1.0 while, player 푄 turns to be fully defective to pursue higher ex- 0.8 0.8 pected payoffs in the social dilemmas. State-transition replicator dynamics can represent the theoretical analysis perfectly. 0.6 0.6 1 2 q q 0.4 0.4 5 CONCLUSION

0.2 0.2 In this paper, we propose a new approach named state-transition replicator dynamics for analyzing the dynamics of multi-agent learn- 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 ing in multi-state stochastic games. By describing the dynamic pro- 1 2 p p cess in stochastic game as a Markov chain and utilizing the prop- erties of transition matrix, we obtain a set of replicator dynamics Figure 5: The strategy trajectory traces of state-transition to model the learning process in stochastic games. replicator dynamics in stochastic games with ZD transition Based on our approach and model, we have shown that tran- probabilities. Initial strategy profiles are picked randomly sition probabilities have a significant influence on the dynamics 푠1 푠2 푝1 푝2 in state and . Strategies and do not change and of strategies. If the transition probabilities between states are un- 푞1 푞2 keep the initial value. Meanwhile, strategies and up- balanced and there exists a cooperation back state, cooperative be- date to be 0 in social dilemmas. haviors can prevail in this state when the strategies of players meet some conditions which can be derived by state-transition replica- 1 1 1 1 2 2 2 2 tor dynamics. Moreover, it has also been proven in this paper that payoff vector is 푺푄 = (푏푐푐,푏 ,푏 ,푏 ,푏푐푐,푏 ,푏 ,푏 ). Their re- 푐푑 푑푐 푑푑 푐푑 푑푐 푑푑 a set of specific transition probabilities can control the expected spective expected payoffs are payoffs of players in some stochastic games. ZD transition proba- 풗 · 푺푃 퐷(풛, 풑ˆ, 풒ˆ, 푺푃 ) 퐸푃 = = bilities can fix the expected payoff of an player as a constant value. 풗 · 1 퐷(풛, 풑ˆ, 풒ˆ, 1) No matter what strategies he applies, his expected payoffs donot (36) 풗 · 푺푄 퐷(풛, 풑ˆ, 풒ˆ, 푺푄 ) change. Thus, such player has no motivation to update his strate- 퐸푄 = = 풗 · 1 퐷(풛, 풑ˆ, 풒ˆ, 1) gies. Meanwhile, the opponent can update his own strategies to be fully defective and get an extortionate payoff in the social dilem- where 1 is the vector with all components 1. mas. The expected payoff for one player depends linearly on hisown In this paper, we consider the stochastic games where players payoff vector. For player 푃, we can write his expected payoff as a have separate strategies in different states. Their strategies donot linear relationship as follow, depend on the previous actions taken by players. That is no-memory 퐷(풛, 풑ˆ, 풒ˆ, 훼푺푃 + 훾1) 훼퐸푃 + 훾 = . (37) strategy. However, many previous works have been shown that 퐷(풛, 풑ˆ, 풒ˆ, 1) memory-one strategy, such as TFT, WSLS and ZD strategies have The determinant value of any matrix is zero if two of its columns a great impact on the evolution of cooperation. We think that it are identical or one column is multiple of the other. Since that is straight to extend our method to investigate the dynamics of the fourth column only contains the entries of transition probabil- memory-one strategies in stochastic games. If a player can make ities, the transition probabilities can equate the determinant value decisions based on the previous actions, what is the dominant strate- 퐷(풛, 풑ˆ, 풒ˆ, 훼푺푃 +훾1) to 0. This will be true if transition probabilities gies in stochastic games and what kind of transition probabilities satisfy the case ˜풛 ≡ (푧1 − 1, 푧2 − 1, 푧3 − 1, 푧4 − 1, 푧5, 푧6, 푧7, 푧8) = can influence the dynamics of strategies are worth to be investi- 훼푺푃 + 훾1, which means gated. 퐷 (˜풛, 풑ˆ, 풒ˆ, 훼푺푃 + 훾1) 훼퐸푃 + 훾 = = 0. (38) 퐷(˜풛, 풑ˆ, 풒ˆ, 1) REFERENCES [1] Daan Bloembergen, Karl Tuyls, Daniel Hennes, and Michael Kaisers. 2015. Evo- That is to say, the expected payoff of player 푃 is fixed as a constant lutionary Dynamics of Multi-Agent Learning: A Survey. Journal of Artificial 훾 value, − . We call such transition probabilities as zero-determinant Intelligence Research 53 (aug 2015), 659–697. https://doi.org/10.1613/jair.4818 훼 [2] Michael Bowling and Manuela Veloso. 2002. Multiagent learning using a variable (ZD) transition probabilities. Player 푃 cannot change his expected learning rate. Artificial Intelligence 136, 2 (apr 2002), 215–250. https://doi.org/ payoff by alternating his strategies. Thus, in the stochastic game, 10.1016/s0004-3702(02)00121-2 푃 [3] Lucian Busoniu, Robert Babuska, and Bart De Schutter. 2008. A Comprehensive player has no motivation to update his strategies and will keep Survey of Multiagent Reinforcement Learning. IEEE Transactions on Systems, his initial ones. Meanwhile, his opponent player 푄 can update his Man, and Cybernetics, Part C (Applications and Reviews) 38, 2 (mar 2008), 156– own strategies to pursue higher expected payoffs. 172. https://doi.org/10.1109/tsmcc.2007.913919 [4] Thomas Dietz, Elinor Ostrom, and Paul C. Stern. 2003. The Struggle toGovern Let us consider the following payoff matrices representing Pris- the Commons. Science 302, 5652 (dec 2003), 1907–1912. https://doi.org/10.1126/ oner’s Dilemma games in a stochastic game science.1091015     [5] Aram Galstyan. 2011. Continuous strategy replicator dynamics for multi-agent 3, 3 1, 4 8, 8 6, 9 Q-learning. Autonomous Agents and Multi-Agent Systems 26, 1 (sep 2011), 37–53. (퐴1, 퐵1) = , (퐴2, 퐵2) = . (39) 4, 1 2, 2 9, 6 7, 7 https://doi.org/10.1007/s10458-011-9181-6 [6] Roman Gorbunov, Emilia Barakova, and Matthias Rauterberg. 2013. Design of With parameters 훼 = 0.2 and 훾 = −1.0, we can obtain the ZD tran- social agents. Neurocomputing 114 (aug 2013), 92–97. https://doi.org/10.1016/j. sition probabilities ˜풛 = (0.6, 0.2, 0.8, 0.4, 0.6, 0.8, 0.2, 0.4) and the neucom.2012.06.046 훾 [7] Daniel Hennes, Dustin Morrill, Shayegan Omidshafiei, Rémi Munos, Julien Per- value 퐸푃 = − 훼 = 5. As shown in Figure 5, player 푃 does not olat, Marc Lanctot, Audrunas Gruslys, Jean-Baptiste Lespiau, Paavo Parmas,

427 Main Track AAMAS 2021, May 3-7, 2021, Online

Edgar Duèñez Guzmán, and Karl Tuyls. 2020. Neural Replicator Dynamics: Mul- [18] Bettina Rockenbach and Manfred Milinski. 2006. The efficient interaction of tiagent Learning via Hedging Policy Gradients. International Foundation for Au- indirect reciprocity and costly punishment. Nature 444, 7120 (dec 2006), 718– tonomous Agents and Multiagent Systems, Richland, SC, 492–501. 723. https://doi.org/10.1038/nature05229 [8] Daniel Hennes, Karl Tuyls, and Matthias Rauterberg. 2009. State-Coupled [19] Francisco C. Santos, Marta D. Santos, and Jorge M. Pacheco. 2008. Social diver- Replicator Dynamics. In Proceedings of The 8th International Conference on Au- sity promotes the emergence of cooperation in public goods games. Nature 454, tonomous Agents and Multiagent Systems - Volume 2 (Budapest, Hungary) (AA- 7201 (jul 2008), 213–216. https://doi.org/10.1038/nature06940 MAS ’09). International Foundation for Autonomous Agents and Multiagent Sys- [20] Karl Sigmund, Hannelore De Silva, Arne Traulsen, and Christoph Hauert. 2010. tems, Richland, SC, 789–796. Social learning promotes institutions for governing the commons. Nature 466, [9] Christian Hilbe, Štěpán Šimsa, Krishnendu Chatterjee, and Martin A. Nowak. 7308 (jul 2010), 861–863. https://doi.org/10.1038/nature09203 2018. Evolution of cooperation in stochastic games. Nature 559, 7713 (jul 2018), [21] Satinder Singh, Michael Kearns, and Yishay Mansour. 2000. Nash Convergence 246–249. https://doi.org/10.1038/s41586-018-0277-x of Gradient Dynamics in General-Sum Games. In Proceedings of the Sixteenth [10] Michael Kaisers and Karl Tuyls. 2010. Frequency Adjusted Multi-Agent Q- Conference on Uncertainty in Artificial Intelligence (Stanford, California) (UAI’00). Learning. In Proceedings of the 9th International Conference on Autonomous Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 541–548. Agents and Multiagent Systems: Volume 1 - Volume 1 (Toronto, Canada) (AAMAS [22] Qi Su, Alex McAvoy, Long Wang, and Martin A. Nowak. 2019. Evolutionary ’10). International Foundation for Autonomous Agents and Multiagent Systems, dynamics with game transitions. Proceedings of the National Academy of Sciences Richland, SC, 309–316. 116, 51 (nov 2019), 25398–25404. https://doi.org/10.1073/pnas.1908936116 [11] Tomas Klos, Gerrit Jan van Ahee, and Karl Tuyls. 2010. Evolutionary Dynam- [23] Mohammad A. Taha and Ayman Ghoneim. 2020. Zero-determinant strategies ics of Regret Minimization. In Machine Learning and Knowledge Discovery in in repeated asymmetric games. Appl. Math. Comput. 369 (mar 2020), 124862. Databases. Springer, Springer Berlin Heidelberg, 82–96. https://doi.org/10.1007/ https://doi.org/10.1016/j.amc.2019.124862 978-3-642-15883-4_6 [24] Peter Vrancx, Pasquale Gurzi, Abdel Rodriguez, Kris Steenhaut, and Ann Nowé. [12] Michael L. Littman. 1994. Markov games as a framework for multi-agent re- 2015. A Reinforcement Learning Approach for Interdomain Routing with Link inforcement learning. In Machine Learning Proceedings 1994. Elsevier, 157–163. Prices. ACM Transactions on Autonomous and Adaptive Systems 10, 1 (mar 2015), https://doi.org/10.1016/b978-1-55860-335-6.50027-1 1–26. https://doi.org/10.1145/2719648 [13] Alex McAvoy and Martin A. Nowak. 2019. Reactive learning strategies for it- [25] Peter Vrancx, Karl Tuyls, and Ronald Westra. 2008. Switching Dynamics of Multi- erated games. Proceedings of the Royal Society A: Mathematical, Physical and Agent Learning. In Proceedings of the 7th International Joint Conference on Au- Engineering Sciences 475, 2223 (mar 2019), 20180819. https://doi.org/10.1098/ tonomous Agents and Multiagent Systems - Volume 1 (Estoril, Portugal) (AAMAS rspa.2018.0819 ’08). International Foundation for Autonomous Agents and Multiagent Systems, [14] Martin A. Nowak. 2006. Five Rules for the Evolution of Cooperation. Science Richland, SC, 307–313. 314, 5805 (dec 2006), 1560–1563. https://doi.org/10.1126/science.1133755 [26] Peter Vrancx, Katja Verbeeck, and Ann Nowe. 2008. Decentralized Learning [15] William H Press and Freeman J Dyson. 2012. Iterated Prisoner's Dilemma in Markov Games. IEEE Transactions on Systems, Man, and Cybernetics, Part contains strategies that dominate any evolutionary opponent. Proceedings of B (Cybernetics) 38, 4 (aug 2008), 976–981. https://doi.org/10.1109/tsmcb.2008. the National Academy of Sciences 109, 26 (may 2012), 10409–10413. https: 920998 //doi.org/10.1073/pnas.1206569109 [27] Wenbo Wang, Pengda Huang, Peizhao Hu, Jing Na, and Andres Kwasinski. 2016. [16] David G. Rand and Martin A. Nowak. 2013. Human cooperation. Trends in Learning in Markov Game for Femtocell Power Allocation with Limited Coor- Cognitive Sciences 17, 8 (aug 2013), 413–425. https://doi.org/10.1016/j.tics.2013. dination. In 2016 IEEE Global Communications Conference (GLOBECOM). IEEE, 06.003 1–6. https://doi.org/10.1109/glocom.2016.7841950 [17] Axelrod Robert. 1984. The evolution of cooperation. Basic Books, New York.

428