arXiv:1811.06126v1 [cs.GT] 15 Nov 2018 gnsfr oldn ru n atclyadjust tactically some and 2009; when (Olson group severer commons colluding even the a are of becomes form that tragedy It agents phenomenon equilibria the self-interested 1968). well-known Hardin competitive called This and to sub-optimal. is occur likely settle will most design, will positive problem much or a agents is free-rider incentive achieve proper utility and the agent’s a others without each share 2005; to improved, Luke whereby bilaterally and benefits (Panait equilibrium can give social agents systems to the costs multi-agent Although underlying dilemma 1993). social Turner a the many is there a that of is in are challenge One a such there for 2018; 1997). reasons when Santos Kraus and are Pacheco, or challenge (Santos, 2009; agents Wooldridge interaction agents of the long-standing long-run to number when large a a agents especially in is systems, involved the artificial multi-agent outcomes on in in multi-agent good tendency real-world achieve irreversible Enforcing substantial an intelligence. becomes of systems emergence The et r ohsl-erigadcollusive. and oppo self-learning the both when these are even nents that cooperation promote show still experimentally can can strategies players we all Moreover, among achieved. cooperation a be stable for optimal a individually opponent, upper is global single the cooperation via full get a only cannot Since alliance utility hand. colluding his any maximize and can cooperation strategies such opponent with single confronted iden- when a that, novelly find we and them game, tify th goods Under public exist. do all repeated strategies conventional among such op- that cooperation show her we mutual of Here, agents. promote each further of can and she utility whereby ponents the dilemmas, agent on social an manipulate such for bene- strategy simultaneously in extra simple that no get accepted exists to there commonly other is each It with fits. self-intereste in when work inter severer even agents long-run become agents’ things during and actions arise th scenarios, may many problem in free-rider dilemmas social th inherent to of due However, existence systems. th multi-agent of for one objectives is main agents substantial among cooperation Enforcing ∗ orsodn author. Corresponding hnhiJa ogUniversity Tong Jiao Shanghai Introduction oprto nocmn n olso Resistance Collusion and Enforcement Cooperation Abstract [email protected] a Li Kai nRpae ulcGosGames Goods Public Repeated in ny d e e e e - - , nvriyo lcrncSineadTcnlg fChina of Technology and Science Electronic of University nyvagoa oprto yalpaesadaycollud- any and players all utility his by maximize cooperation can global opponent via single only with a strategy, Confronted enforcing. a strat- cooperation such Markov call designed we delicately one We which a offer egy, players game. to the repeated of method some) the theoretical (or a of propose state accordingly resulting then the player’ game, and single this a strategy between of relation (MDP) compelling Process a introduce Decision we Markov joint the the of of first model We the PGG. up repeated set the of framework the under dilemma be 2001). can 2017 Epstein (Santos which 2015; norm Young equilibria, social a good to out equivalent some T loosely select or considered agents. to all rational’ how for ‘socially is of problem decisions inferior the an than to agents, lead rati proper will individually without the agents PGG, is, nal repeated problem the op- the for design social But strategy the directed. be which can within the timum equilibria for allows many interaction according a of long-run dilemma, emergency such the Theorem, social In Folk no repeated agent. the is to the every PGG of for stage model good single standard public a the the of contribute to fully equilibrium contribution to Nash decide the agents tokens, all maximum when a yield utility can total system game multi-player this though than less and e ugs omnitra for interval common a suggest s ies economic experimental literature, In With contributors. than agents. size the group all the to distributed equally tteedo ahsae h oesi h o r multi- are pot the in tokens factor the a stage, each by of plied tokens end their the At of many how cide iso tg ae ete ntl rifiieymn) In many). se- infinitely a or of finitely stage consist (either games games could each goods stage PGG public of repeated abstract ries is A the which 1994). of (PGG), (Ledyard Games extension Goods dynamic Public a repeated the is mas social different utility multi-agent from average the today. critical 2017; conquer research equally (Telser higher remains to It of equilibrium dilemma. sought a cooperative has Decades get disciplines the 2014). al. to et of Hilbe that strategy than group the nti ok epooeanwter osletesocial the solve to theory new a propose we work, this In n ftems ersnaiemdl o oildilem- social for models representative most the of One [email protected] t h aegopof group same the , 0 . 75 n ogHao Dong o-otiuigfe-iesgi more gain free-riders non-contributing , r n ersnigte‘ulcgo’adis and good’ ‘public the representing , Iac akr n hms18) Al- 1984). Thomas and Walker, (Isaac, n pae eetdPG i analysis Via PGG. repeated -player ∗ e optit ulcpot. public a into put to n r gnspiaeyde- privately agents sgetrthan greater is r mle than smaller 0 tud- . 3 he o- ir n s ; ing alliance cannot get extra benefits. Since a full coopera- X ∪ Y = {1, ··· ,n}. Denote the number of cooperators tion is individually optimal for any single opponent, the so- and defectors in a group by functions #c(·) and #d(·), re- cial optimum, which means the stable cooperation among spectively. all players, can be achieved. Moreover, one can find that For example, if n = 4 and an outcome is o = ccdc, then the classic strategies including and Win-Stay from a single player i’s perspective, o = (ai, a−i), where Loss-Shift all can be viewed as special cases of our proposed a is the action of i and a− is the action profile of other cooperation enforcing strategies. Because any individual or i i players. If i = 1, then a− = cdc and # (a− )=2. From group deviation from the social optimum will cause utility 1 c 1 a joint perspective of two players i and j, o = (aiaj , a−ij ), decrease, the proposed strategy is also collusion-resistant. where a− is the action profile excluding i and j. If i = 1 Although the proposed strategy is theoretically proved to ij and j =2, then a−12 = dc and #c(a−12)=1. Besides, we have the above nice properties, we additionally show some define the n-player mutual cooperation and mutual defection numerical evidences. In the simulation of the repeated PGG, as o = cn and o = dn. the proposed strategy is run against an opponent group con- If the above game is infinitely repeated, then the strat- taining rational learning players. The learning player is mod- egy for each player is a mapping from any history to the eled to behave like human. Such a simulation can help us current action a, which could be very complex. We con- understand the performance of the proposed strategy in the centrate on the strategies depending only on what happened real world. in the previous stage. These are called memory-one strate- The significance of this newly proposed cooperation- gies and can be described as a vector of probabilities con- enforcing strategy is multi-fold. First, it identifies the in- ditioning on the outcomes of the previous stage. Intuitively, dividual player’s power to unilaterally influence the payoff a memory-one strategy for the n-player PGG could be 2n- of each player in a group, which has not been discovered dimensional. However, if the agents are symmetric play- previously. Based on this unilateral influence, the n-player ers, then the only issue that matters for a player is how mutual cooperation can be established simply by one player many opponents cooperate (i.e., #c(a−i)) and defect (i.e., launching such a strategy, while any additional condition for #d(a−i)). Therefore, from a focal player i’s perspective, cooperation such as initial condition, extra punishment and any stage outcome o ∈ An can be represented by a tuple reward, social links, is not required. This makes the model (ai, k) ∈ A × K, where ai ∈ A = {c, d} is i’s action and simple but profound. Second, it also allows us to understand k = #c(a−i) ∈ K = {0, 1, ··· ,n − 1} is the number of how difficult it is for players to reach the social optimum and cooperating opponents. Then the dimension of the strategy to sustain such an optimum. It could be further utilized to space is reduced to |A × K| = 2n. Define the memory-one establish a good under different environments. strategies for the repeated PGG as follows. Third, it reduces the amount of optimization computation required by the players, since they do not have to search Definition 2. A memory-one strategy for player i can be defined as a vector p , where each in their entire strategy space if they are aware of the fact = [pai,k] pai,k = P that at least one opponent is using the cooperation enforcing {c|(ai, k)} is the conditional probability for cooperation strategy. Last but not least, since under such strategies, all- given the previous stage outcome (ai, k). More explicitly, player cooperation is the social optimum, any complicated p = (pc,0, ··· ,pc,k, ··· ,pc,n−1, multilateral deviation from it will cause damage to the aver- (1) age utility. Thus these strategies naturally possess the prop- pd,0, ··· ,pd,k, ··· ,pd,n−1) . erty of resisting the collusion, which has been a very tough Such a strategy vector contains 2n components, while the problem in multi-player games. They provide us with a new full outcome space is 2n-dimensional. This is because the possibility for solving collusion problems in more advanced strategy uses a same cooperating probability pai,k for some game settings. outcomes (ai, a−i) that have the same number of cooper- ating opponents k = #c(a−i). Most of the well-known Repeated Multiplayer Games classic strategies can be written in the above form. For in- stance, unconditional cooperation (ALLC) can be given by Consider an infinitely repeated (PGG) (1, 1, ··· , 1); unconditional defection (ALLD) is written as among n players with perfect monitoring. In each stage, ev- (0, 0, ··· , 0); the Repeat strategy, which repeats the previ- ery player i ∈ {1, ··· ,n} decides whether to cooperate ous action, can be represented as pR with p = 1 and (c) by contributing her own endowment e > 0 to the pub- c,k p =0 for all k. lic pool, or to defect ( ) by contributing nothing. Without d,k d At the end of each stage game, the stage outcome is loss of generality, let . We denote player ’s action by e = 1 i perfectly observed by all players and the endowments in a ∈ A = {c, d}. After every player executes a certain ac- i the current public pool are multiplied by a factor r, rep- tion, there are totally n possible outcomes for each stage. 2 resenting the ‘’ and is equally distributed to Let o n denote the outcome vector. To analyze the game ∈ A all the agents. Then the same stage game repeats. We call from a group’s perspective, we define a general representa- r the game’s marginal per-capita rate of return (MPCR) tion of outcomes as follows. n (Isaac, Walker, and Thomas 1984). For every unit a player Definition 1. A stage outcome can be represented as a tuple spends, the MPCR measures how much the she gets back. o = (aX , aY ), where aX and aY are the action profiles With r smaller than the group size n, non-contributing free- from player groups X and Y , respectively. X ∩ Y = ∅ and riders gain more than contributors. For repeated games, it is common to define payoffs as the only controls the upper bound of each opponent’spayoff, but average payoff that players receive over all stages. To calcu- also promote mutual cooperation. Based on this definition, late these expected payoffs, we first define a focal player’s we can immediately obtain the following propositions. exact stage game payoff values and payoff vectors. Proposition 1. A memory-one strategy p is cooperation en- Definition 3. Confronting with k ∈ K cooperating oppo- forcing only if pc,n−2 < 1. nents, player i’s stage payoff is R = r(k + 1)/n − 1 if ai,k Proof. Suppose that player i uses a memory-one coopera- ai = c and is Rai,k = rk/n if ai = d, where r ∈ (1,n). tion enforcing strategy p with pc,n−2 = 1 while one of her Then the payoff vector for i is π = (R ) ∈ × . i ai,k (ai,k) A K opponents j adopts ALLD and all of the remaining players Based on the payoff vectors, now one can derive the aver- apply ALLC. In this case, the state is stationary in which age payoffs. Denote by v(t) = (vo(t))o∈An the probability j defects and all his opponents cooperate. As a result, the n distribution over the outcome o(t) ∈ A at the t-th stage expected payoff of j will be Rd,n−1 > Rc,n−1, which con- game, where vo(t) = P{o(t) = o}. Let v be a limit point tradicts the condition (3) in Definition 4. 1 t of the sequence { t m=1 v(m)} and we refer to it as limit distribution. Note thatP when the limit distribution is unre- Proposition 2. In a repeated public goods game, coopera- tion enforcing strategies exist only if r 1 . lated with the outcome of the initial stage game, then it is a n > 2 unique stationary distribution of the Markov chain formed Proof. Assume that in a repeated public goods game with by all players’ memory-one strategies. Player i’s expected r 1 n ≤ 2 , i adopts a memory-one strategy while j uses ALLD payoff is an inner product of the limit distribution and her and the remaining players all use ALLC. In this case, no mat- π v payoff vector, which is written as πi = i · . ter what i’s choice is in each round, πj is between Rd,n−2 Both π and v are associated with the Markov chain, r 1 and Rd,n− . From ≤ , we have Rd,n− > Rd,n− ≥ which consists of all players’ strategies. But recent re- 1 n 2 1 2 Rc,n− . As a result, j could obtain an expected payoff searches significantly reveal that in repeated games, there 1 πj > Rc,n− . This contradicts point (3) in Definition 4. is an underlying relation between a single player’s strat- 1 egy p and the limit distribution of states v (Akin 2012; The above two propositions indicate basic necessary con- Hilbe et al. 2014). We write this result in a muti-player form ditions for the cooperation enforcing strategies, which could as follows. be convenient for verifying the suitable game structure and Lemma 1. Let p denote the focal player’s memory-one designing the strategies. Now we show that in a repeated strategy and let pR denote the Repeat strategy, then for any PGG with perfect monitoring,if all players use certain coop- limit distribution v (irrespective of the outcome of the initial eration enforcing strategies, they actually constitute an equi- stage), we have librium. The equilibrium concept we use is Markov Perfect (p − pR) · v =0. (2) Equilibrium(MPE) (Maskin and Tirole 2001), which is a re- finement of perfect equilibrium (SPE). This general relation requires no additional information of the game. We call it Akin’s Lemma, and it will be an Proposition 3. If every player i ∈ {1, ··· ,n} adopts a important foundation for our following analysis. cooperation enforcing strategy pi, then the strategy profile (p1, p2, ··· , pn) is a Markov Perfect Equilibrium. Cooperation Enforcing Strategy Proof. According to (1) and (2) in Definition 4, every player Traditionally, in repeated multiplayer games, one player choosescooperatingin the first stage and will continue coop- cannot unilaterally manipulate the evolution of the game and erating if all the others also cooperate in the previous stage, it is even more difficult for a single player to enforce stable thus the n-player mutual cooperation recursively sustains cooperationamong all players. However,based on the model and every player obtains the same expected payoff Rc,n−1. in the previous section, now we will novellyanalyze that one It is also known from (3) that if some players adopt coop- player can delicately design a whereby any eration enforcing strategies, the others’ payoffs cannot ex- single opponent can maximize his expected payoff only via ceed Rc,n−1. Cooperation is always the and global cooperation. We call such strategies cooperation en- any unilateral deviation induces a payoff loss. Thus a profile forcing and formally define them as follows. of cooperation enforcing strategies is a . Definition 4 (Cooperation Enforcing Strategy). A memory- Since the game is infinitely repeated, the strategies actu- one strategy p for player i is called cooperation enforcing if ally constitute a sub-game perfect equilibrium. The cooper- and only if the following conditions hold: ation enforcing strategies we defined are Markov strategies, (1) Player i cooperates in the first stage. therefore, the profile of cooperation enforcing strategies is a Markov perfect equilibrium and generates a stable mutual (2) pc,n−1 =1, i.e., player i continues to cooperate if all the opponents cooperated in the previous stage. cooperation. (3) Either for all players l ∈ {1, 2, ··· ,n}, πl = Rc,n−1, Definition 4 describes our goal for designing the cooper- or for any opponent j ∈{1, 2, ··· ,n}\{i}, πj < Rc,n−1. ation enforcing strategies but the conditions are still not op- According to Definition 4, if a focal player adopts a coop- erational enough. Now we begin to do some reasoning and eration enforcing strategy, the payoffs of her opponents only derivations, which will finally help us to obtain the detailed have two possible forms, either less than or equal to the pay- constraints for the cooperation enforcing strategies. We first off of mutual cooperation. As a result, the focal player not show the following two lemmas as stepping stones. Lemma 2. The third condition in Definition 4 is equivalent Moreover, the difference between the expected payoff of j to a logical implication: and that of mutual cooperation is written as

∃j 6= i, πj ≥ Rc,n− ⇒ vcn =1. (3) 1 πj − Rc,n−1 Proof. The third condition in Definition 4 can be formally =(Rc,n−2 − Rc,n−1)(ucc,n−3 + udc,n−2)+ ··· rewritten as a logical disjunction (∀j 6= i, πj < Rc,n−1) ∨ + (Rc,k − Rc,n−1)(ucc,k−1 + udc,k) (7) (∀l, πl = Rc,n−1). According to basic logical theory, this disjunction is equivalent to an implication ¬(∀j 6= i, πj < + (Rd,k − Rc,n−1)(ucd,k−1 + udd,k)+ ··· Rc,n−1) ⇒ (∀l, πl = Rc,n−1), which is further simplified + (Rd,0 − Rc,n−1)udd,0. as ∃j 6= i, πj ≥ Rc,n−1 ⇒ ∀l, πl = Rc,n−1. Now we only need to prove the following equivalence relation Proof. Eq. (6) uses notations in Definition 5 and is directly derived from Akin’s lemma. Since πj − Rc,n−1 = (πj − ∀l, πl = Rc,n−1 ⇔ vcn =1. (4) Rc,n−11)·v. Merge similar terms, then we have eq. (7). Since the social optimum where every player obtains an ex- pected payoff Rc,n−1 is reached only when mutual cooper- Based on the above stepping stones, now we can propose ation state is stable, i.e., vcn = 1 and all the other elements a solution for the cooperation enforcing strategies. in v are 0, therefore eq. (4) holds and this completes the n Theorem 1. In the repeated public goods game with r> 2 , proof. if a memory-one strategy p cooperates in the first stage and Lemma 2 is of great importance since it shows that the satisfies the following constraints: maximum payoff for any opponent j is πj = Rc,n− and is 1 p − =1 realized only when the n-player mutual cooperation is sta- c,n 1  p − < 1 ble. Any type of unilateral deviation from cooperation will  c,n 2  possibly incur payoff decrease, thus the temptation of free  (1 − pc,n−2)(Rc,n−1 − Rc,n−2) riding is suppressed. One can see that the cooperation en-  pd,n−1 <  Rd,n−1 − Rc,n−1 forcing strategy essentially sets up a payoff upper bound for   each of the opponent. As long as the opponents are rational,  (1 − pc,n−2)(Rc,n−1 − Rd,n−2)  pd,n−2 < in order to reach the payoff upper bound, they will sustain  Rd,n−1 − Rc,n−1  , (8) cooperation.  ··· Now we know the basic mathematical constraints for v.  (1 − pc,n−2)(Rc,n−1 − Rd,k) To derive the detailed constraints for the cooperation enforc-  pd,k < ing strategy p, we need to further expand eq. (2) in Akin’s  Rd,n−1 − Rc,n−1  lemma and do some derivation and analysis. To this end, we  ···  first need to analyze the game from a joint view of the focal   (1 − pc,n−2)(Rc,n−1 − Rd,0) player i and a specific opponent j. Based on Definition 1,  pd,0 <  Rd,n−1 − Rc,n−1 we define the following marginal limit distribution.  Definition 5 (Marginal Limit Distribution). Let (aiaj, k) then p is a cooperation enforcing strategy. denote a stable outcome in which i and j’s actions are ai and aj , respectively, and there are k = #c(a−ij ) ∈ M = Proof. Conditions (1) and (2) in Definition 4 are trivially {0, 1, ··· ,n−2} cooperators among the other n−2 players. satisfied. We only need to prove p meets condition (3).

Denote by uaiaj ,k the total probability of all such outcomes, Suppose that i adopts p and j is an opponent with objec- tive πj ≥ Rc,n−1. Let u be the marginal limit distribution. u = v a− . (5) aiaj ,k X aiaj , ij By Proposition 1, we have 1−pc,n−2 > 0. Thus multiplying a #c( −ij )=k both sides by 1−pc,n−2 does not change an inequality. After that, introduce eq. (6) into (1 − pc,n−2)(πj − Rc,n−1) ≥ 0, Then u = (uaiaj ,k)ai,aj ∈A,k∈M is the marginal limit distri- bution which is a 4(n − 1)-dimensional vector. Specifically, we have n n ucc,n−2 = vc and udd,0 = vd . (1 − pc,n−2)(πj − Rc,n−1) According to Definition 5, now we can expand eq. (2) and (9) =bcc,n−3ucc,n−3 + ··· + bdd,0udd,0 ≥ 0, calculate the detailed relation between the game’s limit dis- tribution and a single player’s strategy. where each baiaj ,k is the coefficient of uaiaj ,k. According to Lemma 3. Given player i’s strategy p and the marginal Lemma 2, we only need to prove the following implication is TRUE. limit distribution u = (uaiaj ,k)ai,ak∈A,k∈M , the formula in Akin’s lemma for p can be rewritten as bcc,n− ucc,n− + ··· + bdd, udd, ≥ 0 3 3 0 0 (10) (1 − pc,n−2)ucd,n−2 ⇒ucc,n−3 = ··· = udd,0 =0. =(−1+ p − )u − + (−1+ p − )u − + ··· c,n 1 cc,n 2 c,n 2 cc,n 3 Since each element of the marginal limit distribution + (−1+ pc,k)(ucc,k−1 + ucd,k)+ pd,k(udc,k−1 + udd,k) uaiaj ,k ≥ 0, one sufficient condition for Implication (10)

+ ··· + pd,0udd,0. to be TRUE is that all the coefficients baiaj ,k < 0. Note in

(6) Lemma 3 each baiaj ,k is a function of the corresponding p. There are two corresponding constraints for pc,k: The difference between the collusive payoff and the equilib- bcc,k−1 < 0 and bcd,k < 0. These two inequalities require: rium (cooperative) payoff is:

(1 − pc,n−2)(Rc,n−1 − Rc,k) (m − k) (n − mr) pc,k < +1 π˜ − Rc,n−1 = . (15)  R − − R − mn  d,n 1 c,n 1  . (11) (1 − pc,n−2)(Rc,n−1 − Rd,k+1) This difference indicates collusion is beneficial only when pc,k < +1 r< n , but this contradicts with the necessary condition for  Rd,n−1 − Rc,n−1 m  p. Therefore, in a repeated PGG, if a cooperation enforcing n Since Theorem 1 requests r > 2 , we have Rc,n−1 > strategy p exists, then it is collusion resistant. Rd,k+1 > Rc,k for all k ∈{1, 2, ··· ,n−3}.As a result, the right sides of the inequalities in eq. (11) are all greater than 7 1 and the constraints are all vanished. As for pc,0, there is only one constraint bcd,0 < 0 and we can immediately ver- 6 ify that the constraint is satisfied. The same conclusion can 5 be derived for pc,n−2. Likewise, for pd,k, we have the following constraint 4 Factor r (1 − pc,n−2)(Rc,n−1 − Rd,k) 3 pd,k < . (12) Rd,n−1 − Rc,n−1 2

As for pd,n−1, it should satisfy the constraint bdc,n−2 < 0, 1 which results in 2 3 4 5 6 7 8 Group size n (1 − pc,n−2)(Rc,n−1 − Rc,n−2) pd,n−1 < . (13) Rd,n−1 − Rc,n−1 Figure 1: Schematic representation of parameter regions for the repeated public goods game. Furthermore, because Rc,n−1 > Rd,n−2 > ··· > Rd,0 and Rc,n−1 > Rc,n−2, the right sides of the above inequali- Figure 1 shows in which parameter regions the cooper- ties are all greater than 0, which makes these constraints en- ation enforcing strategies can exist, given that the public forceable. Collecting all of the inequalities above, we com- goods game forms a social dilemma (i.e., 1

kRc,k+n−m−1 + (m − k)Rd,k+n−m of WSLS and GT. Take WSLS as an example, what are “win” π˜ = . (14) and “lose” is correlated with payoffs of all players, which m is further affected by how many other players cooperate and Cooperation Enforcement on Learning and defect. Thus, there could be more general definitions of the Collusive Players terms “win” and “lose”. As shown in the above two corol- laries, with the help of Theorem 1, we can at least identify In the previous sections, we theoretically identified the coop- multiple strategies which possess the good features that tra- eration enforcing and collusion resistant strategies and found ditional WSLS also has. These WSLS-like strategies could that a collection of such strategies constitutes a Markov per- be generalized versions of the traditional WSLS into multi- fect equilibrium. In the real world multi-agent systems, if a player games. player has no idea of the delicately designed strategies, be- Although Grim Trigger strategies can form a Markov per- ing a learning player and gradually improving her own pol- icy based on interaction history is a more realistic choice. In fect equilibrium, as this kind of strategy has no tolerance to- this section, we will show that, against the cooperation en- wards other players’ mistakes, it may lead to poor results in the real world especially when there is noise and uncertainty. forcing strategies, even the non-strategic learning opponent can gradually learn to cooperate. WSLS behaves more kindly than GT and is more robust in noisy environments (Nowak and Sigmund 1993). From a learning player’s point of view, a can be regarded as a sequential decision-making under un- To give an intuitive understanding of the effect of the co- certainty. In each stage game, she makes a decision accord- operation enforcing strategies, now we use WSLS to make a ing to the history of interactions with her opponents. In this case study. In Figure 2, we show an example of how the fo- paper, we only consider the history consisting of the previ- cal player x adopting a cooperation enforcing strategy con- ous stage game outcome. The uncertainty comes from the strains the expected payoffs of her opponents y and z. Player inherent stochastic properties or evolution mechanisms of x uses WSLS strategy, whereas the strategy of y and z are other players’ policies. For the learning player, her optimal randomly sampled from the space of memory-one strategies plan in this MDP of the game can be solved by using dy- 105 times. The cube in the graph is the space of three play- namic programming (DP) based on the Bellman optimality ers’ payoff tuple (π , π , π ), with axises x, y and z repre- x y z equation (Mahadevan 1996) senting the payoffs of player x, y and z respectively. Each ∗ ∗ ∗ ′ ′ blue dot corresponds to a specific expected payoff tuple. The Q (o,a) = max E [Ro′ − R + Q (o ,a )] . (16) a′∈A pink plane πy = Rc,n−1 on the left and the yellow one πz = Rc,n−1 on the top illustrate the upper bounds of y and Here Q∗(o,a) is the action value function which estimates z’s payoffs, respectively. We can see the area of blue dots the score for selecting action a after observing the previ- never crosses these two planes, which indicates that none of ous stage game outcome o. After executing a, the learn- the opponentscan obtain a payoff greater than that of mutual ing player receives a result of the current stage o′ and ∗ cooperation. The social optimum (Rc,n−1, Rc,n−1, Rc,n−1) obtains a payoff Ro′ . R is the optimal average payoff. is a red point, which is the only point of intersection of the There are many researches for solving the Bellman opti- blue region and the two bound planes. mality equation and reinforcement learning is one which re- cently attracts much attention (Sutton and Barto 2018). We use the average reward reinforcement learning approach (Gosavi 2004) and implement it in Algorithm 1.

Algorithm 1: A Learning Player’s Strategy Initialize a matrix: Q(o,a) ← 0 for all o ∈ An,a ∈ A ; Initialize an estimate of the average payoff R¯ ← 0 ; Set outcome of the initial stage game o(0) ← cn; Set the learning rate parameters α, β; for t =1, 2, ··· do Take action a with ǫ-greedy policy based on Q(o(t − 1),a); Receive stage game outcome o(t) and payoff R; ′ δ ← R − R¯ + maxa′ Q(o(t),a ) − Q(o(t − 1),a); Q(o(t − 1),a) ← Q(o(t − 1),a)+ αδ; if Q(o(t − 1),a) = max ′ Q(o(t − 1),a) then Figure 2: Illustration of a cooperation enforcing strategy in a a R¯ ← (1 − β)R¯ + β[(t − 1)R¯ + R]/t; 3-player repeated public goods game with r = 2. The strat- end egy of the focal player x is WSLS while the strategies of end y and z are sampled randomly from the space of memory- one strategies. Each blue dot corresponds to a feasible ex- pected payoff tuple (πx, πy, πz) and the coordinate of the To investigate whether cooperation enforcing strategies red point in the corner is (Rc,n−1, Rc,n−1, Rc,n−1). The can bring about mutual cooperation when their adopters pink (left) plane and the yellow (top) one are πy = Rc,n−1 confront learning players, several scenarios need to be an- and πz = Rc,n−1 respectively. alyzed and simulated. For the first scenario, to analyze the A 1.2 B 1.2 C 1.2 x 1.0 1.0 y 1.0 z 0.8 0.8 0.8

0.6 0.6 0.6

0.4 0.4 0.4 till that stage till that stage till that stage Average payoff Average payoff Average payoff 0.2 0.2 0.2 x x 0.0 y 0.0 0.0 y z z -0.2 -0.2 -0.2

5000 10000 15000 20000 0 5000 10000 15000 20000 0 5000 10000 15000 20000 Stage number Stage number Stage number

Figure 3: Illustration of average payoffs during the 3-player repeated public goods game (with r =2) in three different scenar- ios. The payoff of mutual cooperation is Rc,n−1 = 1. In each panel, the strategy of player x is fixed to a specific cooperation enforcing strategy WSLS whereas the policies of y and z can be WSLS or reinforcement learning. (A) y uses WSLS while z applies reinforcement learning. (B) y and z independently evolve their own policies via reinforcement learning. (C) In this scenario, the strategy of x is declared beforehand whereas y and z are learning followers who make an alliance. equilibrium, we let all players adopt cooperation enforcing shows an example of the collapse of mutual cooperation, in strategies, except for one who uses a reinforcement learn- which multiple players independently learn their own poli- ing approach. For the second scenario, there are more inde- cies cannot lead to mutual cooperation. pendent learning players, i.e., some players use cooperation Although it seems difficult for cooperation enforcing enforcing strategies while others independently adopt rein- strategies to bring about cooperation when their adopters forcement learning strategies. For the third scenario, we as- confront multiple independent learning opponents, we can sume the cooperation enforcing strategy be declared in ad- conquer this difficulty by establishing a Stackelberg game. vance and analyze its performance when the followers are In this Stackelberg setting, some players need to declare learning players, which is also known as Stackelberg game their cooperation enforcing strategies in advance and then (DeMiguel and Xu 2009). The simulation results are shown the learning players can explore and evolve their strategies. in Figure 3. Eventually, all the learning players will find the best re- In Proposition 3, we have known that when all players sponses to their declared policies, which is to fully coop- adopt cooperation enforcing strategies, they are in an equi- erate. We refer to those players who commit strategies as librium and a player who deviates cannot obtain a payoff leaders and the others as followers. Because followers are greater than that of mutual cooperation. There remains a witting of the strategies of leaders, they can even coordinate question here: if the deviator is not malicious but just a their policies with each other and work in collusion to com- learning player who has no idea of the cooperation enforc- pete with leaders. We can obviously extend the reinforce- ing strategies, can she eventually learn to cooperate? The ment learning method for a single player to this alliance. answer should be yes. If the strategies of all the other play- The fixed strategies of the leaders produce a stationary envi- ers are fixed to cooperation enforcing strategies, the envi- ronment for the alliance. Therefore, mutual cooperation can ronment for the only learning player is not that complicated. be enforced with a learning alliance. Figure 3(C) provides an Besides, mutual cooperation is the only optimal solution to instance that if cooperation enforcing strategies are Stackel- this scenario. Therefore, as long as the reinforcement learn- berg leader strategies, they do promote cooperation. ing player sufficiently explores this environment, she will generate a “near-optimal” solution to eq. (16) and evolve Conclusions and Future Work to cooperation. Figure 3(A) illustrates that a learning player We present a class of delicately designed Markov strate- learns to cooperate quickly and shares the optimum with all gies for repeated games. Using such a strategy, a single the cooperation enforcing strategy players. player can simultaneously constrain the utilities of all her To analyze the case of more complicated scenarios where opponents and enforce mutual cooperation among all of multiple learning players exist, we set up a simulation in players. The optimal value of any opponent’s utility can which only a portion of players use cooperation enforcing only be realized through all players’ cooperating. More- strategies while others learn. In this case, however, learn- over, we prove that as long as there exists at least one ing players can not finally find the optimal solution. This player adopting such a strategy, any type of colluding al- is because they optimize their own policies independently, liance cannot get a higher payoff than that of n-player mu- which will lead to a failure of obtaining the dynamic global tual cooperation. Our results also show that these strate- knowledge of the whole game system. If other players adapt gies can still promote cooperation, even when the oppo- their strategies, the environment for one player will be non- nents are non-strategic learning players. Although we are stationary and the optimization target for the learning will using the conventional repeated public goods game model, shift. Actually, multi-agent learning is still a challenging the approaches in our paper could be easily extended to var- issue in the artificial intelligence community. Figure 3(B) ious multi-player games. Immediate future work is about the effect of a non-linear marginal per-capita rate of re- [Maskin and Tirole 2001] Maskin, E., and Tirole, J. 2001. turn on the group cooperation, the generalization of coop- Markov perfect equilibrium: I. observable actions. Journal eration enforcement in stochastic games with large action of Economic Theory 100(2):191–219. space (McAvoy and Hauert 2016), as well as the extension [McAvoy and Hauert2016] McAvoy, A., and Hauert, C. into multi-agent systems where agents make use of different 2016. Autocratic strategies for iterated games with arbitrary orders of theory of mind (Albrecht and Stone 2018). More- action spaces. Proceedings of the National Academy of Sci- over, further research on multi-agent reinforcement learn- ences 113(13):3573–3578. ing (Perolat et al. 2017) could be potentially inspired by this [Nowak and Sigmund 1993] Nowak, M., and Sigmund, K. work. 1993. A strategy of win-stay, lose-shift that outper- forms tit-for-tat in the prisoner’s dilemma game. Nature References 364(6432):56. [Akin 2012] Akin, E. 2012. Stable cooperative solutions for [Nowak 2006] Nowak, M. A. 2006. Five rules for the evolu- the iterated prisoner’s dilemma. arXiv preprint 1211.0969. tion of cooperation. Science 314(5805):1560–1563. [Albrechtand Stone 2018] Albrecht, S. V., and Stone, P. [Olson 2009] Olson, M. 2009. The Logic of Collective Ac- 2018. Autonomous agents modelling other agents: A com- tion: Public Goods and the Theory of Groups, volume 124. prehensive survey and open problems. Artificial Intelligence Harvard University Press. 258:66–95. [Panait and Luke 2005] Panait, L., and Luke, S. 2005. Co- [Dawes 1980] Dawes, R. M. 1980. Social dilemmas. Annual operative multi-agent learning: The state of the art. Au- Review of Psychology 31(1):169–193. tonomous Agents and Multi-Agent Systems 11(3):387–434. [DeMiguel and Xu 2009] DeMiguel, V., and Xu, H. 2009. A [Perolat et al. 2017] Perolat, J.; Leibo, J. Z.; Zambaldi, V.; stochastic multiple-leader stackelberg model: analysis, com- Beattie, C.; Tuyls, K.; and Graepel, T. 2017. A multi-agent putation, and application. Operations Research 57(5):1220– reinforcement learning model of common-pool resource ap- 1235. propriation. In NIPS 2017, 3643–3652. [Dietz, Ostrom, and Stern 2003] Dietz, T.; Ostrom, E.; and [Santos, Pacheco, and Santos 2018] Santos, F. P.; Pacheco, Stern, P. C. 2003. The struggle to govern the commons. J. M.; and Santos, F. C. 2018. Social norms of coopera- Science 302(5652):1907–1912. tion with costly reputation building. In AAAI 2018. [Epstein 2001] Epstein, J. M. 2001. Learning to be thought- [Santos 2017] Santos, F. P. 2017. Social norms of coopera- less: Social norms and individual computation. Computa- tion in multiagent systems. In AAMAS 2017, 1859–1860. tional Economics 18(1):9–24. [SuttonandBarto2018] Sutton, R. S., and Barto, A. G. [Gosavi 2004] Gosavi, A. 2004. Reinforcement learning for 2018. Reinforcement Learning: An Introduction. MIT press. long-run average cost. European Journal of Operational Re- [Telser 2017] Telser, L. G. 2017. Competition, Collusion, search 155(3):654–674. and . Routledge. [Hardin1968] Hardin, G. 1968. The tragedy of the com- [Turner 1993] Turner, R. M. 1993. The Tragedy of the Com- mons. Science 162(3859):1243–1248. mons and Distributed AI Systems. Department of Computer [Hilbeetal. 2014] Hilbe, C.; Wu, B.; Traulsen, A.; and Science, University of New Hampshire. Nowak, M. A. 2014. Cooperation and control in multiplayer [Wooldridge 2009] Wooldridge, M. 2009. An Introduction social dilemmas. Proceedings of the National Academy of to Multiagent Systems. John Wiley & Sons. Sciences 111(46):16425–16430. [Young 2015] Young, H. P. 2015. The evolution of social [Isaac, Walker, and Thomas 1984] Isaac, R. M.; Walker, norms. Annual Review of Economics 7(1):359–387. J. M.; and Thomas, S. H. 1984. Divergent evidence on free riding: An experimental examination of possible expla- nations. Public Choice 43(2):113–149. [Kraus 1997] Kraus, S. 1997. Negotiation and cooperation in multi-agent environments. Artificial Intelligence 94(1- 2):79–97. [Ledyard 1994] Ledyard, J. O. 1994. Public goods: A sur- vey of experimental research. Handbook of 111–251. [Leibo et al. 2017] Leibo, J. Z.; Zambaldi, V.; Lanctot, M.; Marecki, J.; and Graepel, T. 2017. Multi-agent reinforce- ment learning in sequential social dilemmas. In AAMAS 2017, 464–473. [Mahadevan 1996] Mahadevan, S. 1996. Average reward re- inforcement learning: Foundations, algorithms, and empiri- cal results. Machine Learning 22(1-3):159–195.