Gradient Play in Multi-Agent Markov Stochastic Games: Stationary Points and Convergence

Runyu Zhang Zhaolin Ren School of Engineering and Applied Science School of Engineering and Applied Science Harvard University Harvard university [email protected] [email protected]

Na Li School of Engineering and Applied Science Harvard university [email protected]

Abstract

We study the performance of the gradient play algorithm for multi-agent tabular Markov decision processes (MDPs), which are also known as stochastic games (SGs), where each agent tries to maximize its own total discounted reward by making decisions independently based on current state information which is shared between agents. Policies are directly parameterized by the probability of choosing a certain action at a given state. We show that Nash equilibria (NEs) and first order stationary policies are equivalent in this setting, and give a non-asymptotic global convergence rate analysis to an -NE for a subclass of multi-agent MDPs called Markov potential games, which includes the cooperative setting with identical rewards among agents as an important special case. Our result shows that the number of iterations to reach an -NE scales linearly, instead of exponentially, with the number of agents. Local geometry and local stability are also considered. For Markov potential games, we prove that strict NEs are local maxima of the total potential function and fully-mixed NEs are saddle points. We also give a local convergence rate around strict NEs for more general settings.

1 Introduction The past decade has witnessed significant development in reinforcement learning (RL), which achieves successes in various tasks such as playing Go and video games. It is natural to extend RL arXiv:2106.00198v2 [cs.LG] 17 Jun 2021 techniques to real-life societal systems such as traffic control, autonomous driving, buildings, and energy systems. Since most such large scale infrastructures are multi-agent in nature, multi-agent reinforcement learning (MARL) has gained increasing attention in recent years [Daneshfar and Bevrani, 2010, Shalev-Shwartz et al., 2016, Vidhate and Kulkarni, 2017, Xu et al., 2020]. Among RL algorithms, policy gradient methods are particularly attractive due to their flexibility and capability to incorporate structured state and action spaces. This property makes them appealing for multi-agent learning, where agents usually need to update their policies through interactions with other agents either collaboratively or competitively. For instance, many recent works [Zhang et al., 2018, Chen et al., 2018, Wai et al., 2018, Li et al., 2019, Qu et al., 2020] have studied the convergence rate and sample complexity of gradient-based methods for collaborative multi-agent RL problems. In these problems, agents seek to maximize a global reward function collaboratively while the agents’ policies and the learning procedure suffer from information constraints, i.e., each agent can only choose its own local actions based on the information it observes. However, due to a lack of understanding of the optimization landscape in these multi-agent learning problems, most such works

Preprint. Under review. can only show convergence to a first-order stationary point. Deeper understanding of the quality of these stationary points is missing even in the simple identical-reward multi-agent RL setting. In contrast, there has been some exciting recent theoretical progress on the analysis of the optimization landscape in centralized single-agent RL settings. Recent works have shown that the landscape for single-agent policy optimization enjoys the gradient domination property in both linear control [Fazel et al., 2018] and Markov decision processes (MDPs) [Agarwal et al., 2020], which guarantees gradient descent/ascent to find the global optimum despite the nonconvex landscape. Motivated by the theoretical progress in single-agent RL, we seek to study the landscape of multi-agent RL problems to see if similar results exist. In this paper, we center our study on the multi-agent tabular MDP problem. Apart from the identical reward case mentioned earlier, we also make an attempt to generalize our analysis to game settings where rewards may vary among agents. The multi-agent tabular MDP problem is also known as the (SG) in the field of . The study of SGs dates back to as early as the 1950s by Shapley [1953], where the notion of SGs as well as the existence of Nash equilibria (NEs) were first established. A series of works has since been developed on designing NE-finding algorithms, especially in the RL setting (e.g. [Littman, 1994, Bowling and Veloso, 2000, Shoham et al., 2003, Bu¸soniuet al., 2010, Lanctot et al., 2017, Zhang et al., 2019a] and citations within). While well-known classical algorithms for solving SGs are mostly value-based, such as Nash-Q learning [Hu and Wellman, 2003], Hyper-Q learning [Tesauro, 2003], and WoLF-PHC [Bowling and Veloso, 2001], gradient-based algorithms have also started to gain popularity in recent years due to their advantages as mentioned earlier (e.g. [Abdallah and Lesser, 2008, Foerster et al., 2017, Zhang and Lesser, 2010]). In this work, our aim is to gain a deeper understanding of the structure and quality of first-order stationary points for these gradient-based methods. Specifically, taking a game-theoretic perspective, we strive to shed light on the following questions: 1) How do the first- order stationary points relate to the NEs of the underlying game?, 2) Do gradient-based algorithms guarantee convergence to a NE?, 3) What is the stability of individual NEs?. For simpler finite action static (stateless) game settings, these questions have already been widely discussed [Shapley, 1964, Crawford, 1985, Jordan, 1993, Krishna and Sjöström, 1998, Shamma and Arslan, 2005, Kohlberg and Mertens, 1986, Van Damme, 1991]. For static continuous games, a recent paper [Mazumdar et al., 2020] in fact proved a negative result which suggests that gradient flow has stationary points (even local maxima) that are not necessarily NEs. Conversely, Zhang et al. [2019b] designed projected nested-gradient methods that provably converge to NEs in zero-sum linear quadratic games with continuous state-action spaces, linear dynamics, and quadratic rewards. However, much less is known in the setting of SGs with finite state-action spaces and general Markov transition probability. Contributions. In our paper, we consider the gradient play algorithm for the infinite time-discounted reward SG where an agent’s local policy is directly parameterized by the probability of choosing an action from the agent’s own action space at a given state. We first show that first order stationary policies and Nash equilibria are equivalent for these directly parameterized local policies. We derive this by generalizing the gradient domination property in [Agarwal et al., 2020] to the multi-agent setting. Our result does not contradict Mazumdar et al. [2020]’s work which constructs examples of stable first order stationary points that are non-NEs, because their counterexamples consider general continuous games where the functions may not satisfy gradient domination. Additionally, we provide a global convergence rate analysis for a special class of SG called Markov potential games [González-Sánchez and Hernández-Lerma, 2013, Macua et al., 2018], which includes identical reward multi-agent RL [Tan, 1993, Claus and Boutilier, 1998, Panait and Luke, 2005] as an important special case. We show that gradient play (equivalent to projected gradient ascent for  P  |S| i |Ai| Markov potential games) reaches an - within O 2 steps, where |S|, |Ai| denote the size of the state space and action space of agent i respectively. The convergence rate shows that the number of iterations to reach an -NE scales linearly with the number of agents, instead of  Q  |S| i |Ai| exponentially with rate O 2 as shown in the work of Agarwal et al. [2020]. Although the convergence to NEs of different learning algorithms in static potential games has been very well-studied in the literature [Monderer and Shapley, 1996a,b, Monderer and Sela, 1997, Marden et al., 2009], to the best of our knowledge, our result provides the first non-asymptotic convergence rate analysis of convergence to a NE in stochastic games for gradient play. We also study the local geometry around some specific types of equilibrium points. For Markov potential games, we show that strict NEs are local maxima of the total potential function and that fully mixed NEs are saddle

2 points. For general multi-agent MDPs, we show that strict NEs are locally stable under gradient play and provide a local convergence rate analysis.

2 Problem setting and preliminaries An n-agent (tabular) Markov decision process (MDP) (or a stochastic game (SG) [Shapley, 1953])

M = (S, A = A1 × A2 × · · · × An, P, r = (r1, r2, . . . , rn), γ, ρ) (1) is specified by: a finite state space S; a finite action space A = A1 × A2 × · · · × An, where Ai is 0 0 the action space of agent i; a transition model P where P (s |s, a) = P (s |s, a1, a2, . . . , an) is the 0 probability of transitioning into state s upon taking action a := (a1, . . . , an) (each agent i taking action ai respectively) in state s; i-th agent’s reward function ri : S × A → [0, 1], i = 1, 2, . . . , n where ri(s, a) is the immediate reward of agent i associated with taking action a in state s; a discount factor γ ∈ [0, 1); an initial state ρ over S. A stochastic policy π : S → ∆(A) (where ∆(A) is the probability simplex over A) specifies a decision-making in which agents choose their actions jointly based on the current state in a stochastic fashion, i.e. Pr(at|st) = π(at|st). A distributed stochastic policy is a special subclass of stochastic policies, with π = π1 × π2 × · · · × πn, where πi : S → ∆(Ai). For distributed stochastic policies, each agent takes its action based on the current state s independently of other agents’ choices of actions, i.e.: n Y Pr(at|st) = π(at|st) = πi(ai,t|st), at = (a1,t, . . . , an,t). i=1 Q For notational simplicity, we define: πI (aI |s) := i∈I πi(ai|s), where I is an index set that is a subset of {1, 2, . . . , n}. Further, we use the notation −i to denote the index set {1, 2, . . . , n}\i. π Agent i’s value function Vi : S → R, i = 1, 2, . . . , n is defined as the discounted sum of future rewards starting at state s via executing π, i.e.

" ∞ # π X t Vi (s) := E γ ri(st, at) π, s0 = s , t=0 ∞ where the expectation is with respect to the randomness of trajectory τ = (st, at, ri,t)t=0 with s0 drawn from initial distribution ρ and at ∼ π(·|st), st+1 = P (·|st, at). π π Agent i’s Q-function Qi : S × A → R and the advantage function Ai : S × A → R are defined as: " ∞ # π X t π π π Qi (s, a) := E γ ri(st, at) π, s0 = s, a0 = a ,Ai (s, a) := Qi (s, a) − Vi (s). t=0

π Additionally, we define agent i’s ‘averaged’ Q-function Qi : S × Ai → R and ‘averaged’ advantage- π function Ai : S × Ai → R as:

π X π π X π Qi (s, ai) := π−i(a−i|s)Qi (s, ai, a−i), Ai (s, ai) := π−i(a−i|s)Ai (s, ai, a−i). (2) a−i a−i Direct distributed policy parameterizations. In this work, we consider direct distributed policy parameterization, i.e., policies are parameterized by θ = (θ1, . . . , θn), where agent i’s policy is parameterized by θi: πi,θi (ai|s) = θi,(s,ai), i = 1, 2, . . . , n. (3)

For notational simplicity, we abbreviate πi,θi (ai|s) as πθi (ai|s), and θi,(s,ai) as θs,ai . Here θi ∈ ∆(A )|S|, i.e. θ is subject to the constraints θ ≥ 0 and P θ = 1 for all s ∈ S. The i i s,ai ai∈Ai s,ai entire policy is given by n n Y Y πθ(a|s) = πθi (ai|s) = θs,ai . i=1 i=1 πθ πθ πθ πθ πθ θ θ θ θ θ We also abbreviate Vi ,Qi ,Ai , Qi , Ai as Vi ,Qi ,Ai , Qi , Ai . We use |S| Xi := ∆(Ai) , X := X1 × X2 × · · · × Xn

3 to denote the feasible region of θi and θ. Additionally we denote agent i’s total reward as: θ Ji(θ) = Ji(θ1, . . . , θn) := Es0∼ρVi (s0). In the game setting, the concept of a Nash equilibrium is often used to characterize the performance of individuals’ policies. We provide a definition below. ∗ ∗ ∗ Definition 1. (Nash equilibrium) Policy θ = (θ1, . . . , θn) is called a Nash equilibrium (NE) if the following inequality holds: ∗ ∗ 0 ∗ 0 Ji(θi , θ−i) ≥ Ji(θi, θ−i), ∀θi ∈ Xi, i = 1, 2, . . . , n. The equilibrium is called a strict NE if the inequality holds strictly, i.e.: ∗ ∗ 0 ∗ 0 Ji(θi , θ−i) > Ji(θi, θ−i), ∀θi ∈ Xi, i = 1, 2, . . . , n. The equilibrium is called a pure NE if θ∗ corresponds to a deterministic policy. An equilibrium that is not pure is called a mixed NE. Further, an equilibrium is called a fully mixed NE if every entry of θ∗ is strictly positive, i.e.: θ∗ > 0, ∀ a ∈ A , ∀ s ∈ S, i = 1, 2, . . . , n. s,ai i i Policy gradient for direct distributed parameterization. It is useful to define the discounted state µ visitation distribution dθ of a policy πθ given an initial state distribution µ as: ∞ µ X t θ dθ (s) := Es0∼µ(1 − γ) γ Pr (st = s|s0), (4) t=0 θ where Pr (st = s|s0) is the state visitation probability that st = s when executing πθ starting at state ρ s0. For simplicity, we use dθ(s) to denote dθ(s) that is the discounted state visitation distribution specifically for initial distribution ρ for the MDP (1). The policy gradient is then given by (e.g. [Sutton et al., 1999]): 1 ∇ V θ(s ) = [∇ log π (a|s)Qθ(s, a)], i = 1, 2, . . . , n. (5) θEs0∼ρ i 0 1 − γ Es∼dθ Ea∼πθ (·|s) θ θ i Applying (5), we have the following lemma (proof can be found in Appendix A): Lemma 1. For direct distributed parameterization (3),

∂Ji(θ) 1 θ = dθ(s)Qi (s, ai) (6) ∂θs,ai 1 − γ

3 Gradient domination for multi-agent MDP, and the equivalence of first order stationarity and Nash equilibrium

Agarwal et al. [2020] first established gradient domination for centralized tabular MDP under direct parameterization (Lemma 4.1 in [Agarwal et al., 2020]). We can show that similar property still holds for n-agent MDPs. This property will play a critical role in proving the equivalence of first order stationary policies and NEs in the later part of this section. Lemma 2. (Gradient domination) For direct distributed parameterization (3), we have that for any θ = (θ1, . . . , θn) ∈ X :

0 dθ0 > 0 Ji(θi, θ−i) − Ji(θi, θ−i) ≤ max (θi − θi) ∇θi Ji(θ), ∀θi ∈ Xi, i = 1, 2, . . . , n (7) dθ ∞ θi∈Xi

dθ0 dθ0 (s) 0 0 where d := maxs d (s) , and θ = (θi, θ−i). θ ∞ θ The proof of Lemma 2 can be found in Appendix B. For the single-agent case (n = 1), (7) is consistent

0 dθ0 > with the result in Agarwal et al. [2020], i.e.: J(θ ) − J(θ) ≤ d maxθ∈X (θ − θ) ∇J(θ). θ ∞ > On the right hand side, maxθ∈X (θ − θ) ∇J(θ) can be understood as a scalar notion of first order stationarity in the constrained optimization setting. Thus if one can find θ that is (approximately) a first order stationary point, then gradient domination guarantees θ will also be (near) optimal in terms of function value. Such conditions are a standard device to establish convergence to global optima in non-convex optimization, as they effectively rule out the existence of bad critical points. As for the multi-agent case, we will show that this property rules out the existence of critical points that are not NEs. We begin by giving a formal definition of first order stationary policies.

4 ∗ ∗ ∗ Definition 2. (First order stationary policy) Policy θ = (θ1, . . . , θn) is called a first order stationary policy if the following inequality holds: 0 ∗ > ∗ 0 (θi − θi ) ∇θi Ji(θ ) ≤ 0, ∀θi ∈ Xi, i = 1, 2, . . . , n Comparing Definition 1 (for NE) and Definition 2, we know that NEs are first order stationary policies, but not vice versa. For each agent i, first order stationarity does not rule out saddle points ∗ ∗ and does not imply that θi is optimal among all possible θi given other agents’ policies being θ−i. However, interestingly, under appropriate assumptions, we can show that NEs are equivalent to first order stationary policies using the gradient domination property. Specifically, we make the following assumption on the MDPs we study.

Assumption 1. The n-agent MDP (1) satisfies: dθ(s) > 0, ∀s ∈ S, ∀θ ∈ X Assumption 1 requires that every state is visited with positive probability, which is a standard assumption to prove convergence in the RL literature (e.g. [Agarwal et al., 2020, Mei et al., 2020]). Theorem 1. Under Assumption 1, first order stationary policies and NEs are equivalent. Due to space limit, we defer the proof to Appendix B. 4 Convergence analysis of gradient play for Markov potential games Since the previous section suggests that first order stationary policies and NEs are equivalent, it is natural to explore the performance of first order methods, i.e., gradient based methods. The simplest such candidate might be the gradient play algorithm, which takes the following form: (t+1) (t) (t) θi = P rojXi (θi + η∇θi Ji(θi )), η > 0. (8) Gradient play can be viewed as a ‘better response’ strategy, where each agent updates its own parameters by gradient ascent with respect to its own reward. While the gradient play dynamics in (8) looks simple and intuitive, it is in fact difficult to show that the dynamics converge to an equilibrium point, especially to a mixed NE, in general stochastic games. Even in the much simpler static game setup, gradient play-based learning algorithms might fail to converge [Shapley, 1964, Crawford, 1985, Jordan, 1993, Krishna and Sjöström, 1998]. The major difficulty lies in the fact that the vector n field {∇θi Ji(θ)}i=1 is not a potential gradient vector field. Accordingly, its dynamics can exhibit behavior more complicated than just convergence or divergence, e.g., chaos or convergence to a limit cycle. Thus, in this section, we restrict our analysis to a special subclass of games, namely the potential game, which is known to enjoy better convergence properties [Monderer and Shapley, 1996a, Monderer and Sela, 1997]. We now give our definition of potential games for n-agent MDPs. Definition 3. (Markov potential game) A Markov decision process defined as (1) is called a Markov potential game if there exists a potential function φ : S × A1 × · · · × An → R such that for any 0 agent i and any pair of policy parameters (θi, θ−i), (θi, θ−i) : " ∞ # " ∞ # X t 0 X t E γ ri(st, at) π = (θi, θ−i), s0 = s − E γ ri(st, at) π = (θi, θ−i), s0 = s t=0 t=0 " ∞ # " ∞ # X t 0 X t =E γ φ(st, at) π = (θi, θ−i), s0 = s − E γ φ(st, at) π = (θi, θ−i), s0 = s , ∀ s. t=0 t=0 Similar definitions can be found in [González-Sánchez and Hernández-Lerma, 2013, Macua et al., 2018] for continuous game settings, where the Markov potential game has many applications, e.g. the great fish war [Levhari and Mirman, 1980], the stochastic lake game [Dechert and O’Donnell, 2006] etc. In MDP settings, however, ‘Markov potential game’ is admittedly a rather strong assumption and difficult to verify for general MDPs. Nevertheless, one important special class of n-agent MDP that falls into this category is the identical reward setting, where all agents share the same reward function. Note that in this setting, agents take independent actions not because of competitive behavior, but due to physical constraints or simply to reduce the number of parameters. For a Markov potential game, given a policy θ, we can define the ‘total potential function’ Φ: " ∞ # X t Φ(θ) := Es0∼ρ(·) γ φ(st, at) πθ . t=0

5 From the definition of a Markov potential game and the total potential function, we have the following proposition that guarantees a Markov potential game to have at least one pure NE. Proposition 1. (Proof see Appendix C) For a Markov potential game, there’s at least one global ∗ ∗ maximum θ of the total potential function Φ, i.e.: θ ∈ argmaxθ∈X Φ(θ) that is a pure NE. From the definition of the total potential function we obtain the following relationship 0 0 Ji(θi, θ−i) − Ji(θi, θ−i) = Φ(θi, θ−i) − Φ(θi, θ−i). (9)

Thus, ∇θi Ji(θ) = ∇θi Φ(θ), which means that gradient play algorithm (8) is equivalent to running projected gradient ascent with respect to the total potential function Φ, i.e.: (t+1) (t) (t) θ = P rojX (θ + η∇θΦ(θi )), η > 0. (10) To measure the convergence to a NE, we define an -Nash equilibrium as follows: Definition 4. (-Nash equilibrium) Define the ‘NE-gap’ of a policy θ as: 0 NE-gap (θ) := max Ji(θ , θ−i) − Ji(θi, θ−i); NE-gap(θ) := max NE-gap (θ). i 0 i i θi∈Xi i A policy θ is an -Nash equilibrium if: NE-gap(θ) ≤ . Besides Assumption 1, we also assume that the Markov potential game satisfies the following assumption. Assumption 2. (Bounded total potential function) For any policy θ ∈ X , the total potential function Φ(θ) is bounded by: Φmin ≤ Φ(θ) ≤ Φmax. We are now ready to state our convergence result. Theorem 2. (Global convergence to Nash equilibria) Suppose the n-agent MDP is a Markov potential game with potential function φ(s, a), and suppose the total potential function Φ satisfies Assumption 2, then under gradient play algorithm (8) we have: 64M 2(Φ − Φ )|S| Pn |A | min NE-gap(θ(t)) ≤ , whenever T ≥ max min i=1 i , 1≤t≤T (1 − γ)32

dθ where M := max 0 (by Assumption 1, we know that this quantity is well-defined). θ,θ ∈X d 0 θ ∞ The factor M is also known as the distribution mismatch coefficient that characterizes how the state visitation varies with the choice of policies. Given an initial state distribution ρ that has positive

1 dθ 1 1 measure on every state, M can be at least bounded by M ≤ 1−γ maxθ ρ ≤ 1−γ min ρ(s) . The ∞ s proof of Theorem 2 is in Appendix D. Our proof structure resembles the proof of convergence for single-agent MDPs in [Agarwal et al., 2020], where they leverage classical nonconvex optimization results [Beck, 2017, Ghadimi and Lan, 2016] and gradient domination to get the convergence rate of 2 dθ∗ ! 64γ ρ |S||A| ∞ O (1−γ)62 to the global optimum. In fact, our result matches this bound when there is only one agent (the exponential factor on (1−γ) looks slightly different because some factors are hidden implicitly in M and (Φmax −Φmin) in our bound). As a comparison, if we parameterize the Qn n-agent MDP in a centralized way, the size of the action space will be |A| = i=1 |Ai|, which blows up exponentially with respect to the number of agents n. According to the result in [Agarwal et al.,  Qn  |S| i=1 |Ai| 2020], projected gradient ascent needs O 2 steps to find an -optimal policy, whereas we  Pn  |S| i=1 |Ai| only need O 2 steps to find an -NE, which scales linearly with respect to n. However, centralized parameterization can provably find a global optimum, while distributed parameterization can only find a NE. In this sense, distributed parameterization sacrifices some optimality in favor of a smaller parameter space. Apart from enjoying better convergence properties, as suggested by Proposition 1, Markov potential games are guaranteed to have at least one pure NE. A natural follow-up question is whether gradient play (8) can find pure NEs (either deterministically or with high probability). Since the definition of an -NE is silent on the nature of the NE (e.g. pure or mixed), this motivates us to further examine the local geometry around pure and mixed NEs, which contains some second-order information that the definition of -NEs does not capture.

6 Theorem 3. For Markov potential games, any strict NE θ∗ is pure. Additionally, a strict NE is equivalent to a strict local maximum of the total potential function Φ, i.e.: ∃ δ, s.t. ∀ θ 6= θ∗ that satisfies kθ − θ∗k ≤ δ, θ ∈ X , we have Φ(θ) < Φ(θ∗). Since a local maximum is locally asymptotic stable under projected gradient ascent, Theorem 3 suggests that strict NEs are locally stable. In the next section, we generalize the result to the general stochastic game setting. Note that this conclusion does not hold for settings other than stochastic games; for instance, for continuous games, one can use quadratic functions to construct simple counterexamples [Mazumdar et al., 2020]. ∗ Theorem 4. For Markov potential games, if θ is a fully mixed NE, and suppose that Φmin < Φmax (i.e., Φ is not a constant function), then θ∗ is a saddle point with regard to the total potential function Φ, i.e.: ∀ δ > 0, ∃ θ, s.t. kθ − θ∗k ≤ δ and Φ(θ) > Φ(θ∗). Theorem 4 implies that by applying saddle point escaping techniques (see, e.g., [Ge et al., 2015]), first order methods can avoid convergence to fully mixed NEs. Note however that there is a gap between the theorems above and proving convergence to pure NEs, since pure but non-strict NEs as well as non-fully mixed NEs are not considered in these theorems. Nonetheless, we believe that these preliminary results can serve as a valuable platform towards a better understanding of the problem. 5 Local convergence to strict Nash equilibria for general multi-agent MDPs

In this section, we go beyond Markov potential games and consider general multi-agent MDPs. As seen earlier, global convergence to a NE, especially to mixed NEs, remains problematic even in static game settings. Thus, as a preliminary study, we focus our attention on the local stability and local convergence rate around strict NEs in general multi-agent MDPs. Lemma 3. For any n-agent MDP defined in (1), any strict NE θ∗ is also pure, meaning that for each i and s, there exist one a∗(s) such that θ∗ = 1{a = a∗(s)}. Additionally, i s,ai i i ∗ θ∗ ai (s) = argmax Ai (s, ai) (11) ai such that θ∗ ∗ θ∗ ∗ Ai (s, ai (s)) = 0; Ai (s, ai) < 0, ∀ ai 6= ai (s). Lemma 3 suggests that any strict NE is pure and that for every agent i and state s there is one optimal ∗ action ai (s) that maximizes the averaged advantage function of agent i (and thus also the averaged Q function). Thus we can define the following factors which will be useful in the next theorem:

θ∗ θ∗ θ∗ 1 θ∗ ∗ ∆i (s) := min Ai (s, ai) , ∆ := min min dθ (s)∆i (s) > 0. (12) ∗ i s ai6=ai (s) 1 − γ Theorem 5. (Local finite time convergence around strict NE) Define a new metric of policy pa- 0 0 rameters as: D(θ||θ ) := max1≤i≤n maxs∈S kθi,s − θi,sk1, where k · k1 denote the `1- norm. Suppose θ∗ is a strict Nash equilibrium of an n-agent MDP defined in (1). Then for any θ(0) such θ∗ 3 (0) ∗ ∆ (1−γ) that D(θ ||θ )≤ Pn , running gradient play (8) will guarantee: 8n|S|( i=1 |Ai|) ∗  η∆θ  D(θ(t+1)||θ∗) ≤ max D(θ(t)||θ∗) − , 0 , 2 2D(θ(0)||θ∗) which means that gradient play is going to converge within d η∆θ∗ e steps. Proofs of Theorem 5 and Lemma 3 can be found in Appendix F. The convergence rate seems counter- intuitive at first glance. The convergence only requires finitely many steps and the stepsize η can be chosen arbitrarily large so that exact convergence happens in one step. However, observe that our analysis is local, where we assume that the initial policy is sufficiently close to θ∗. Due to the special geometry around strict NEs, one can indeed choose arbitrarily large stepsize for the algorithm to converge in one step in this situation. However, for numerical stability considerations, one should still pick reasonable stepsizes to run the algorithm to accommodate random initializations. Theorem ∗ ∆θ (1−γ)3 5 also shows that the radius of region of attraction for strict NEs is at least Pn , and thus 8n|S|( i=1 |Ai|) ∗ θ∗ with a larger ∆θ will have a larger region of attraction. This intuitively implies that a strict NE ∗ ∗ θ with a larger value gap between the optimal action ai (s) and suboptimal actions will have a larger region of attraction.

7 6 Numerical simulation

s2 = 1 s2 = 2 a2 = 1 a2 = 2 s1 = 1 2 0 a1 = 1 (-1,-1) (-3,0) s1 = 2 0 1 a1 = 2 (0,-3) (-2,-2) Table 1: Reward table for Game 1 Table 2: Reward table for Game 2

Game 1: state-based Our first numerical example studies the empirical perfor- mance of gradient play for an identical-reward Markov potential game. Consider a 2-agent identical reward coordination game problem with state space S = S1 × S2 and action space A = A1 × A2, where S1 = S2 = A1 = A2 = {1, 2}. The state transition probability is given by: P (si,t+1 = 1|ai,t = 1) = 1 − , P (si,t+1 = 1|ai,t = 2) = , i = 1, 2. The reward table is given by Table 1, where agents will only be rewarded if they are in the same state, and state 1 has a higher reward than state 2. Coordination games can be used to model the network effect in , where an agent reaps benefits from being in the same network as other agents. For this specific example, there are two networks with different . Agents can observe the occupancy of each network, and take actions to join one of the networks based on the observation. 1−3 There is at least one fully-mixed NE that is joining network 1 with probability 3(1−2) regardless of the current occupancy of networks, and there are 13 different pure NEs that can be verified numerically (computation of the NEs as well as detailed settings can be found in Appendix). Figure 1 shows a gradient play trajectory whose initial point lies in a close neighborhood of the mixed NE. As the algorithm progresses, we see that the trajectory in Figure 1 diverges from the mixed-NE, indicating that the fully-mixed NE is indeed a saddle point. This corroborates our finding in Theorem 4. Figure 2 shows the evolution of total reward J(θ(t)) for gradient play for different random initial points θ(0). We see the total reward is monotonically increasing for each initial point, which makes sense since gradient play runs projected gradient ascent on the total reward function J. Since the only NEs in this problem are either fully mixed or pure, the combination of Theorems 2 and 4 indicates that gradient play will avoid the fully mixed NE and converge to one of the 13 different pure NEs. In Figure 2, we see different initial points converging to one of 13 different NEs each with a different total reward (some strict NEs with relatively small region of attraction are omitted in the figure). While the total rewards are different, as shown in Figure 3, we see that the NE-gap of each trajectory (corresponding to same initial points in Figure 2) converges to 0. This suggests that the algorithm is indeed able to converge to a NE. Notice that NE-gaps do not decrease monotonically.

Figure 1: (Game 1:) Starting from Figure 2: (Game 1:) Total reward Figure 3: (Game 1:) NE-gap for a close neighborhood of a fully for multiple runs multiple runs mixed NE

∗  ∆θ ratio (mean ± std)% 0.1 433.3 (47.8± 5.1)% 0.05 979.3 (66.3± 4.3)% 0.01 2498.6 (77.4± 2.8)% Table 3: (Game 2:) Relationship of convergence ratio and . Here we #Trials that converge to θ∗ fix γ = 0.95. Convergence ratio is calculated by #Total number of trials . Here we calculate one ratio using 100 trials and the mean and standard Figure 4: (Game 2:) Convergence deviation (std) are calculated by computing the ratio 10 times using to the cooperative NE different trials.

Game 2: multi-stage prisoner’s dilemma The second example — multi-stage prisoner’s dilemma model[Arslan and Yüksel, 2016] — studies gradient play for settings other than Markov potential

8 game. It is also a 2-agent MDP, with S = A1 = A2 = {1, 2}. Assume that the reward for each agent ri(s, a1, a2), i ∈ {1, 2} is independent of state s and is given by Table 2. Similar to the state-based coordination game, the state transition probability is determined by agents’ previous actions:

P (st+1 = 1|(a1,t, a2,t) = (1, 1)) = 1 − , P (st+1 = 1|(a1,t, a2,t) 6= (1, 1)) = 

Here action ai = 1 means that agent i choose to cooperate and ai = 2 means betray. The state s serves as a noisy indicator, with accuracy 1 − , of whether or not both agents cooperated in the previous stage. Unlike the identical-reward state-based coordination game, the two agents in this model obtain different rewards when they pick different actions. The single-stage game corresponds to the famous Prisoner’s Dilemma, and it is well-known that there is a unique NE (a1, a2) = (2, 2), where both agent decide to betray. The dilemma arises from the fact that there exists a joint non-NE strategy (1, 1) such that both players obtain a higher reward than what they get under the NE. However, in the multi-stage case, the introduction of an additional state s allows agents to make decisions based on whether they have cooperated before. Intuitively, given that both agents cooperated at the previous stage, it is more beneficial to keep this cooperation rather than destroy it by betraying. It turns out that cooperation can be achieved in this manner given that the two agents are patient (i.e., γ is close to 1) and that indicator s is accurate enough (i.e.  is close to 0). Apart from the fully betray strategy, where both agents will betray regardless of s, there θ∗ θ∗ = 1, θ∗ = 0, is another strict NE that is s=1,ai=1 s=2,ai=1 where agents will cooperate given that they have cooperated in previous stage, and betray otherwise. We simulate gradient play for this model and mainly focus on the convergence to the cooperative θ∗ θ(0) = 1−0.4δ , θ(0) = 0 δ equilibrium . The initial policy is set as: s=1,ai=1 i s=2,ai=1 , where i’s are uniformly sampled from [0, 1]. The initialization implies that at the beginning both agents are willing to cooperate to some extent given that they cooperated at the previous stage. Figure 4 shows a trial converging to the NE starting from a randomly initialized policy. The size of the region of attraction ∗ #Trials that converge to θ∗ for θ can be reflected by the ratio of convergence ( #Total number of trials ) for multiple trials with different initial points. An empirical estimate of the volume of the region is the convergence ratio times the volume of the uniform sampling area, thus the larger the ratio, the larger the region of attraction. Table 3 demonstrates the change of the ratio with regard to different values of indicator error . Intuitively speaking, the more accurately the state s represents the cooperation situation of agents, the less incentive agents will have for betraying when observing s = 1, and thus the larger the convergence ratio will be. This intuition matches the simulation result as well as the theoretical guarantees on the local convergence around a strict NE in Theorem 5.

7 Conclusion and Discussion

This paper studies the optimization landscape of multi-agent reinforcement learning through a game theoretic point of view. Specifically, we look into the tabular multi-agent MDP problem and prove that all first order stationary policies are NEs under this setting. Convergence rate analysis is also given for a special subclass of stochastic games called the Markov potential game, showing that the convergence rate to a NE scales linearly with the number of agents. Additionally, local geometry around strict NEs and fully-mixed NEs are also studied. We have shown that strict NEs are the local maxima of the total potential function and fully mixed NEs are the saddle points. We also give a local convergence rate around strict NEs for the more general multi-agent MDP setting. We believe that this is a fruitful research direction with many interesting open questions. For instance, one could explore generalizing our work (which assumes access to exact gradients) to a setting where gradients are estimated using data. Extending our results beyond direct policy parameterization, to say softmax parameterization (cf. [Agarwal et al., 2020]), is another interesting topic. Other interesting questions include local stability analysis in more general games (beyond Markov potential games), faster algorithm design (via e.g. natural policy gradient, Gauss-Newton methods), and online algorithm design for stochastic learning.

References S. Abdallah and V. Lesser. A multiagent reinforcement learning algorithm with non-linear dynamics. Journal of Artificial Intelligence Research, 33:521–549, 2008.

9 A. Agarwal, S. M. Kakade, J. D. Lee, and G. Mahajan. On the theory of policy gradient methods: Optimality, approximation, and distribution shift, 2020. G. Arslan and S. Yüksel. Decentralized q-learning for stochastic teams and games. IEEE Transactions on Automatic Control, 62(4):1545–1558, 2016. A. Beck. First-order methods in optimization. SIAM, 2017. M. Bowling and M. Veloso. An analysis of stochastic game theory for multiagent reinforcement learning. Technical report, Carnegie-Mellon Univ Pittsburgh Pa School of Computer Science, 2000. M. Bowling and M. Veloso. Rational and convergent learning in stochastic games. In International joint conference on artificial intelligence, volume 17, pages 1021–1026. Citeseer, 2001. L. Bu¸soniu,R. Babuška, and B. De Schutter. Multi-agent reinforcement learning: An overview. Innovations in multi-agent systems and applications-1, pages 183–221, 2010. T. Chen, K. Zhang, G. B. Giannakis, and T. Ba¸sar. Communication-efficient policy gradient methods for distributed reinforcement learning. arXiv preprint arXiv:1812.03239, 2018. C. Claus and C. Boutilier. The dynamics of reinforcement learning in cooperative multiagent systems. AAAI/IAAI, 1998(746-752):2, 1998. V. P. Crawford. Learning behavior and mixed-strategy nash equilibria. Journal of Economic Behavior & Organization, 6(1):69–78, 1985. F. Daneshfar and H. Bevrani. Load–frequency control: a ga-based multi-agent reinforcement learning. IET generation, transmission & distribution, 4(1):13–26, 2010. W. D. Dechert and S. O’Donnell. The stochastic lake game: A numerical solution. Journal of Economic Dynamics and Control, 30(9-10):1569–1587, 2006. M. Fazel, R. Ge, S. M. Kakade, and M. Mesbahi. Global convergence of policy gradient methods for linearized control problems. CoRR, abs/1801.05039, 2018. URL http://arxiv.org/abs/ 1801.05039. J. N. Foerster, R. Y. Chen, M. Al-Shedivat, S. Whiteson, P. Abbeel, and I. Mordatch. Learning with opponent-learning awareness. arXiv preprint arXiv:1709.04326, 2017. R. Ge, F. Huang, C. Jin, and Y. Yuan. Escaping from saddle points—online stochastic gradient for tensor decomposition. In Conference on learning theory, pages 797–842. PMLR, 2015. S. Ghadimi and G. Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, 156(1-2):59–99, 2016. D. González-Sánchez and O. Hernández-Lerma. Discrete–time stochastic control and dynamic potential games: the Euler–Equation approach. Springer Science & Business Media, 2013. J. Hu and M. P. Wellman. Nash q-learning for general-sum stochastic games. Journal of machine learning research, 4(Nov):1039–1069, 2003. J. S. Jordan. Three problems in learning mixed-strategy nash equilibria. Games and Economic Behavior, 5(3):368–386, 1993. S. M. Kakade and J. Langford. Approximately optimal approximate reinforcement learning. In C. Sammut and A. G. Hoffmann, editors, Machine Learning, Proceedings of the Nineteenth International Conference (ICML 2002), University of New South Wales, Sydney, Australia, July 8-12, 2002, pages 267–274. Morgan Kaufmann, 2002. E. Kohlberg and J.-F. Mertens. On the strategic stability of equilibria. Econometrica: Journal of the Econometric Society, pages 1003–1037, 1986. V. Krishna and T. Sjöström. On the convergence of fictitious play. Mathematics of , 23(2):479–511, 1998.

10 M. Lanctot, V. Zambaldi, A. Gruslys, A. Lazaridou, K. Tuyls, J. Pérolat, D. Silver, and T. Grae- pel. A unified game-theoretic approach to multiagent reinforcement learning. arXiv preprint arXiv:1711.00832, 2017. D. Levhari and L. Mirman. The great fish war: An example using a dynamic cournot-nash solution. Bell Journal of Economics, 11(1):322–334, 1980. Y. Li, Y. Tang, R. Zhang, and N. Li. Distributed reinforcement learning for decentralized linear quadratic control: A derivative-free policy optimization approach, 2019. M. L. Littman. Markov games as a framework for multi-agent reinforcement learning. In Machine learning proceedings 1994, pages 157–163. Elsevier, 1994. S. V. Macua, J. Zazo, and S. Zazo. Learning parametric closed-loop policies for markov potential games. CoRR, abs/1802.00899, 2018. URL http://arxiv.org/abs/1802.00899. J. R. Marden, G. Arslan, and J. S. Shamma. Joint strategy fictitious play with inertia for potential games. IEEE Transactions on Automatic Control, 54(2):208–220, 2009. E. Mazumdar, L. J. Ratliff, and S. S. Sastry. On gradient-based learning in continuous games. SIAM Journal on Mathematics of Data Science, 2(1):103–131, 2020. J. Mei, C. Xiao, C. Szepesvari, and D. Schuurmans. On the global convergence rates of softmax policy gradient methods. In H. D. III and A. Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 6820–6829. PMLR, 13–18 Jul 2020. D. Monderer and A. Sela. and no-cycling conditions. 1997. D. Monderer and L. S. Shapley. Potential games. Games and economic behavior, 14(1):124–143, 1996a. D. Monderer and L. S. Shapley. Fictitious play property for games with identical interests. Journal of economic theory, 68(1):258–265, 1996b. L. Panait and S. Luke. Cooperative multi-agent learning: The state of the art. Autonomous agents and multi-agent systems, 11(3):387–434, 2005. G. Qu, A. Wierman, and N. Li. Scalable reinforcement learning of localized policies for multi-agent networked systems. In Learning for Dynamics and Control, pages 256–266. PMLR, 2020. S. Shalev-Shwartz, S. Shammah, and A. Shashua. Safe, multi-agent, reinforcement learning for autonomous driving. ArXiv, abs/1610.03295, 2016. J. S. Shamma and G. Arslan. Dynamic fictitious play, dynamic gradient play, and distributed convergence to nash equilibria. IEEE Transactions on Automatic Control, 50(3):312–327, 2005. doi: 10.1109/TAC.2005.843878. L. Shapley. Some topics in two-person games. Advances in game theory, 52:1–29, 1964. L. S. Shapley. Stochastic games. Proceedings of the national academy of sciences, 39(10):1095–1100, 1953. Y. Shoham, R. Powers, and T. Grenager. Multi-agent reinforcement learning: a critical survey. Technical report, Technical report, Stanford University, 2003. R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. A Bradford Book, Cambridge, MA, USA, 2018. ISBN 0262039249. R. S. Sutton, D. A. McAllester, S. P. Singh, Y. Mansour, et al. Policy gradient methods for rein- forcement learning with function approximation. In NIPs, volume 99, pages 1057–1063. Citeseer, 1999. M. Tan. Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings of the tenth international conference on machine learning, pages 330–337, 1993.

11 G. Tesauro. Extending q-learning to general adaptive multi-agent systems. Advances in neural information processing systems, 16:871–878, 2003. E. Van Damme. Stability and perfection of Nash equilibria, volume 339. Springer, 1991. D. A. Vidhate and P. Kulkarni. Cooperative multi-agent reinforcement learning models (cmrlm) for intelligent traffic control. In 2017 1st International Conference on Intelligent Systems and Information Management (ICISIM), pages 325–331. IEEE, 2017. H.-T. Wai, Z. Yang, Z. Wang, and M. Hong. Multi-agent reinforcement learning via double aver- aging primal-dual optimization. NIPS’18, page 9672–9683, Red Hook, NY, USA, 2018. Curran Associates Inc. W. Wang and M. Á. Carreira-Perpiñán. Projection onto the probability simplex: An efficient algorithm with a simple proof, and an application. CoRR, abs/1309.1541, 2013. URL http: //arxiv.org/abs/1309.1541. X. Xu, Y. Jia, Y. Xu, Z. Xu, S. Chai, and C. S. Lai. A multi-agent reinforcement learning-based data-driven method for home energy management. IEEE Transactions on Smart Grid, 11(4): 3201–3211, 2020. C. Zhang and V. Lesser. Multi-agent learning with policy prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 24, 2010. K. Zhang, Z. Yang, H. Liu, T. Zhang, and T. Basar. Fully decentralized multi-agent reinforcement learning with networked agents. In International Conference on Machine Learning, pages 5872– 5881. PMLR, 2018. K. Zhang, Z. Yang, and T. Ba¸sar. Multi-agent reinforcement learning: A selective overview of theories and algorithms. arXiv preprint arXiv:1911.10635, 2019a. K. Zhang, Z. Yang, and T. Basar. Policy optimization provably converges to nash equilibria in zero- sum linear quadratic games. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Cur- ran Associates, Inc., 2019b. URL https://proceedings.neurips.cc/paper/2019/file/ 5446f217e9504bc593ad9dcf2ec88dda-Paper.pdf.

12 A Proof of Lemma 1

Proof. According to policy gradient theorem (5): 0 0 ∂Ji(θ) 1 X X 0 0 0 ∂ log πθ(a |s ) θ = dθ(s )πθ(a |s ) Qi (s, a) ∂θs,a 1 − γ ∂θs,a i s0 a0 i Since for direct parameterization: 0 0 0 0 ∂ log πθ(a |s ) ∂ log πθi (ai|s ) 0 0 1 = = 1{ai = ai, s = s} ∂θs,ai ∂θs,ai θs,ai 0 0 1 = 1{ai = ai, s = s} πθi (ai|s) Thus we have that:

∂Ji(θ) 1 X X 0 0 0 0 0 1 θ = dθ(s )πθ(a |s )1{ai = ai, s = s} Qi (s, a) ∂θs,a 1 − γ πθ(ai|s) i s0 a0

1 X 0 1 θ 0 = dθ(s)πθi (ai|s)πθ−i (a−i|s) Qi (s, ai, a−i) 1 − γ πθ(ai|s) a0 −i (13) 1 X 0 θ 0 = dθ(s) πθ−i (a−i|s)Qi (s, ai, a−i) 1 − γ 0 a−i 1 = d (s)Qθ(s, a ) 1 − γ θ i i

B Proofs of Lemma 2 and Theorem 1

Before giving proofs for Lemma 2 and Theorem 1, we first introduce the well-known the performance difference lemma [Kakade and Langford, 2002] in RL literature, which is helpful throughout. A proof is also provided for completeness.

Lemma 4. (Performance difference lemma) For policies πθ, πθ0 , 1 J (θ0) − J (θ) = Aθ(s, a), i = 1, 2, . . . , n (14) i i 1 − γ Es∼dθ0 Ea∼πθ0 i

Proof. (of performance difference lemma) From Bellman’s equation we have that:

0 θ0 θ Ji(θ ) − Ji(θ) = Es0∼ρVi (s0) − Es0∼ρVi (s0)  0  = Qθ(s, a) − V θ(s) + V θ (s) − Qθ(s, a) Es∼ρEa∼πθ0 (·|s) i Es∼ρ i Es∼ρ i Es∼ρEa∼πθ0 (·|s) i

θ h θ0 θ i = A (s, a) + γ θ0 V (s) − V (s) Es∼ρEa∼πθ0 (·|s) i Es0∼ρ,s∼Pr (s1=·|s0) i i θ θ = A (s, a) + γ θ0 A (s, a) Es∼ρEa∼πθ0 (·|s) i Es0∼ρ,s∼Pr (s1=·|s0)Ea∼πθ0 (·|s) i 2 h θ0 θ i + γ θ0 V (s) − V (s) Es0∼ρ,s∼Pr (s2=·|s0) i i = ··· ∞ X t θ = θ0 γ A (s, a) Es0∼ρ,s∼Pr (st=·|s0)Ea∼πθ0 i t=0 1 = Aθ(s, a) 1 − γ Es∼dθ0 Ea∼πθ0 i

13 Proof. (Lemma 2) According to performance difference lemma (14):

0 1 X θ J (θ , θ ) − J (θ , θ ) = d 0 (s)π 0 (a|s)A (s, a) i i −i i i −i 1 − γ θ θ i s,a

1 X X θ = dθ0 (s)πθ0 (ai|s) πθ (a−i|s)A (s, ai, a−i) 1 − γ i −i i (15) s,ai a−i

1 X θ = dθ0 (s)πθ0 (ai|s)A (s, ai). 1 − γ i i s,ai

According to the definition of ‘averaged’ advantage function:

X θ πθi (ai|s)Ai (s, ai) = 0, ∀s ∈ S ai which implies: θ max Ai (s, ai) ≥ 0, ai∈Ai thus we have that:

0 1 X θ Ji(θ , θ−i) − Ji(θi, θ−i) = dθ0 (s)πθ0 (ai|s)A (s, ai) i 1 − γ i i s,ai

1 X θ ≤ dθ0 (s) max Ai (s, ai) 1 − γ ai∈Ai s (16) 1 X dθ0 (s) θ = dθ(s) max Ai (s, ai) 1 − γ d (s) ai∈Ai s θ

1 dθ0 X θ ≤ dθ(s) max Ai (s, ai). 1 − γ d ai∈Ai θ ∞ s

1 P θ We can rewrite 1−γ s dθ(s) maxai∈Ai Ai (s, ai) as: 1 X 1 X d (s) max Aθ(s, a ) = max d (s)π (a |s)Aθ(s, a ) θ i i θ θi i i i 1 − γ ai∈Ai 1 − γ θi∈Xi s s,ai X 1 = max (π (a |s) − π (a |s)) d (s)Aθ(s, a ) θi i θi i θ i i θi∈Xi 1 − γ s,ai (17) X 1 = max (π (a |s) − π (a |s)) d (s)Qθ(s, a ) θi i θi i θ i i θi∈Xi 1 − γ s,ai > = max (θi − θi) ∇θi Ji(θ). θi∈Xi Substituting this into (16), we may conclude that

0 dθ0 > Ji(θi, θ−i) − Ji(θi, θ−i) ≤ max (θi − θi) ∇θi Ji(θ) dθ ∞ θi∈Xi and this completes the proof.

Proof. (Theorem 1) The definition of a Nash equilibrium naturally implies first order stationarity, because for any θi ∈ Xi:

∗ ∗ ∗ ∗ > ∗ ∗ Ji((1 − η)θi + ηθi, θ−i) − Ji(θi , θ−i) = η(θi − θ) ∇θi Ji(θ ) + o(ηkθi − θi k) ≤ 0, ∀ η > 0 Letting η → 0 gives the first order stationary condition:

> ∗ (θi − θ) ∇θi Ji(θ ) ≤ 0, ∀θi ∈ Xi,

14 It remains to be shown that all first order stationary policies are Nash equilibria. From Assumption 1

θ0, θ∗ dθ0 <+∞. we know that for any pair of parameters , d ∗ θ ∞ 0 0 ∗ ∗ ∗ ∗ Take θ = (θi, θ−i), θ = (θi , θ−i). According to Lemma 2, we have that for any first order stationary policy θ∗,

0 ∗ ∗ ∗ dθ0 ∗ > ∗ Ji(θi, θ−i) − Ji(θi , θ−i) ≤ max (θi − θi ) ∇θi Ji(θ ) ≤ 0, ∗ dθ ∞ θi∈Xi which completes the proof.

C Proof of Proposition 1

Proof. First of all, from the definition of NE, the global maximum of the potential function is a NE. We now show that this global maximum is a deterministic policy. From classical results (e.g. [Sutton and Barto, 2018]) we know that there is an optimal deterministic centralized policy ∗ ∗ ∗ ∗ π (a = (a1, . . . , an)|s) = 1{a = a (s) = (a1(s), . . . , an(s))} that maximizes: " ∞ # ∗ X t π = argmax E γ φ(st, at) π, s0 = s . π:S→∆(A) t=0 We now show that this centralized policy can also be represented by direct distributed policy parame- terization. Set θ∗ as: ∗ π ∗ (a |s) = 1{a = a (s)}, θi i i i then: n ∗ Y π (a|s) = π ∗ (a |s) θi i i=1 Since π∗ globally maximizes the discounted summation of potential function φ among centralized policies, which includes all possible direct distributedly parameterized policies, θ∗ also maximizes the total potential function Φ globally among all direct distributed parameterization, which completes the proof.

D Proof of Theorem 2

Lemma 5. Let f(θ) be β-smooth in θ, define gradient mapping: 1 Gη(θ) := (P roj (θ + η∇f(θ)) − θ) . η X The update rule for projected gradient is: + η θ = θ + ηG (θ) = P rojX (θ + η∇f(θ)). Then: (θ0 − θ+)>∇f(θ+) ≤ (1 + ηβ)kGη(θ)kkθ0 − θ+k ∀θ0 ∈ X .

Proof. By a standard property of Euclidean projections onto a convex set, we get that (θ + η∇f(θ) − θ+)>(θ0 − θ+) ≤ 0 =⇒ η∇f(θ)>(θ0 − θ+) + (θ − θ+)>(θ0 − θ+) ≤ 0 =⇒ η∇f(θ)>(θ0 − θ+) − ηGη(θ)>(θ0 − θ+) ≤ 0 =⇒ ∇f(θ)>(θ0 − θ+) ≤ kGη(θ)kθ0 − θ+k =⇒ ∇f(θ+)>(θ0 − θ+) ≤ kGη(θ)kθ0 − θ+k + (∇f(θ+) − ∇f(θ))>(θ0 − θ+)k ≤ kGη(θ)kθ0 − θ+k + βkθ+ − θkkθ0 − θ+k = (1 + ηβ)kGη(θ)kθ0 − θ+k

15 Proof. (Proof of Theorem 2) Recall the definition of gradient mapping: 1 Gη(θ) = (P roj (θ + η∇f(θ)) − θ) . η X 2 P From Lemma 7 we have Φ(θ) is β-smooth with β = (1−γ)3 i |Ai|. Then from standard result (See 1 Theorem 10.15 in [Beck, 2017] or Theorem E.1 in [Agarwal et al., 2020]) we have that for η = β : p2β(Φ − Φ ) min kGη(θ(t))k ≤ max√ min (18) t=0,1,...,T −1 T From gradient domination property (7) we have that: (t+1) 0 (t+1) (t+1) (t+1) NE-gap (θ ) = max Ji(θ .θ ) − Ji(θ , θ ) i 0 i −i i −i θi∈Xi

d 0 (t+1) > (θi.θ−i )  (t+1) (t+1) ≤ max max θi − θ ∇θ Ji(θ ) 0 i i θ ∈X (t+1) (t+1) i i d(θ ,θ ) θi∈Xi i −i ∞ >  (t+1) (t+1) ≤ M max θi − θi ∇θi Φ(θ ) θi∈Xi (t+1) η (t) ≤ M(1 + ηβ) max kθi − θi kkG (θ )k θi∈Xi ≤ M(1 + ηβ)2p|S|kGη(θ(t))k, (t+1) p where the last step follows as kθi − θi k ≤ 2 |S|. And then using (18) and ηβ = 1, we have: p p 2β(Φmax − Φmin) min NE-gap(θt+1) ≤ 4M |S| √ t=0,1,...,T −1 T we can get our required bound of  if we set: p2β(Φ − Φ ) 4Mp|S| max√ min ≤ , T or equivalently 32M 2β(Φ − Φ )|S| T ≥ max min 2 64M 2(Φ − Φ )|S| P |A | = max min i i , 2(1 − γ)3 which completes the proof.

E Proof of Theorem 3 and 4

Proof. (Theorem 3) The proof requires knowledge of Lemma 3 in Section 5 thus we would recom- mend readers to first go through Lemma 3 first. The lemma immediately leads to the conclusion that ∗ ∗ θ∗ θ∗ a strict NE θ should be deterministic. Let ai (s), ∆i (s), ∆i be the same definition as (11)(12) respectively. For any θ ∈ X , Taylor expansion suggests that: Φ(θ) − Φ(θ∗) = (θ − θ∗)>∇Φ(θ∗) + o(kθ − θ∗k) X ∗ > ∗ ∗ = (θi − θi ) ∇θi Ji(θ ) + o(kθ − θ k) i

1 X X X θ∗ ∗ ∗ = dθ∗ (s)A (s, ai)(θs,a − θ ) + o(kθ − θ k) 1 − γ i i s,ai i s ai   1 X X θ∗ X ∗ ∗ ∗ ≤ − dθ (s)∆i (s)  (θs,ai − θs,a ) + o(kθ − θ k) 1 − γ i i s ∗ ai6=ai (s)

16 1 X X θ∗ 1 ∗ ∗ = − d ∗ (s)∆ (s) kθ − θ k + o(kθ − θ k) 1 − γ θ i 2 i,s i,s 1 i s ∗ ∆θ X X ≤ − kθ − θ∗ k + o(kθ − θ∗k) 2 i,s i,s 1 i s ∗ ∆θ ≤ − kθ − θ∗k + o(kθ − θ∗k). 2 Thus for kθ − θ∗k sufficiently small, Φ(θ) − Φ(θ∗) < 0 holds, this suggests that strict NEs are strict local maxima. We now show that this also holds vice versa. Strict local maxima satisfy first order stationarity by definition, and thus by Theorem 1 they are also NEs, we only need to show that they are strict. We prove by contradiction, suppose that there exists a ∗ 0 0 ∗ local maximum θ such that it is non-strict NE, i.e., there exists θi ∈ Xi, θi 6= θi such that: 0 ∗ ∗ ∗ Ji(θi, θ−i) = Ji(θi , θ−i) According to (17) and first order stationarity of θ∗:

1 X θ∗ ∗ > ∗ ∗ dθ (s) max Ai (s, ai) = max (θi − θi ) ∇θi Ji(θ ) ≤ 0. 1 − γ ai∈Ai s θi∈Xi

θ Since maxai∈Ai Ai (s, ai) ≥ 0 for all θ, we may conclude:

θ∗ max Ai (s, ai) = 0, ∀ s ∈ S. ai∈Ai 0 0 We denote θ := (θi, θ−i∗ ), according to (15)

0 ∗ ∗ ∗ 1 X θ∗ 0 = Ji(θ , θ ) − Ji(θ , θ ) = dθ0 (s)πθ0 (ai|s)A (s, ai) ≤ 0. i −i i −i 1 − γ i i s,ai

Since dθ0 (s) > 0, ∀ s, this further implies that

X θ∗ π 0 (a |s)A (s, a ) = 0, ∀ s ∈ S, θi i i i ai

θ∗ η 0 ∗ i.e., π 0 (a |s) is nonzero only if A (s, a ) = 0. Define θ := ηθ + (1 − η)θ , then θi i i i i i i

X θ∗ π η (a |s)A (s, a ) = 0, ∀ s ∈ S. θi i i i ai η η ∗ Thus let θ := (θi , θ−i)

η ∗ ∗ ∗ 1 X θ∗ Ji(θ , θ−i) − Ji(θi , θ−i) = dθη (s)πθη (ai|s)A (s, ai) = 0. i 1 − γ i i s,ai η ∗ ∗ Since kθi − θi k → 0 as η → 0, this contradicts the assumption that θ is a strict local maximum. This suggests that all strict local maxima are strict NEs, which completes the proof.

Proof. (Theorem 4) First, we define the corresponding value function, Q-function and advantage function for potential function φ. " ∞ # θ X t Vφ (s) := E γ φ(st, at) π = θ, s0 = s t=0 " ∞ # θ X t Qφ(s, a) := γ φ(st, at) π = θ, s0 = s, a0 = a t=0 θ θ θ Aφ(s, a) := Qφ(s, a) − Vφ (s).

17 For an index set I ⊆ {1, 2, . . . , n} we define the following averaged advantage potential function of index set I as: θ X θ Aφ,I (s, aI ) := Aφ(s, aI , a−I ). a−I ∗ ∗ We choose an index set I ⊆ {1, 2, . . . , n} such that there exists s , aI such that: θ∗ ∗ ∗ Aφ,I (s , aI ) > 0, (19) and that for any other index set I0 with smaller cardinality: θ∗ 0 Aφ,I0 (s, aI0 ) ≤ 0, ∀ s, aI0 , ∀ |I | < |I|. (20) Because Φ is not a constant, this guarantees the existence of such an index set I. Further, since X θ∗ πθ∗ (aI0 |s)A 0 (s, aI0 ) = 0, ∀ s, I0 φ,I aI0 and that θ∗ is fully-mixed, we have that: θ∗ 0 Aφ,I0 (s, aI0 ) = 0, ∀ s, aI0 , ∀ |I | < |I|. (21) ∗ ∗ 0 We set θ := (θI , θ−I ), where θI is a convex combination of θI , θI ∈ X : ∗ 0 θI = (1 − η)θI + ηθI , η > 0. According to performance difference lemma (14) we have: ∗ ∗ ∗  X θ∗ (1 − γ) Φ(θI , θ−I ) − Φ(θI , θ−I ) = dθ(s)πθI (aI |s)Aφ,I (s, aI ) s,aI X Y  θ∗ = d (s) (1 − η)π ∗ (a |s) + ηπ 0 (a |s) A (s, a ) θ θi i θi i φ,I I s,aI i∈I

X   Y ∗ ∗ 0 ∗ 0  θ = dθ(s) (1−η)πθ (ai0 |s) + ηπθ (ai0 |s) (1−η)πθ (ai|s) + ηπθ (ai|s) A (s, aI ), (∀ i0 ∈ I) i0 i0 i i φ,I s,aI i∈I\{i0} X Y  θ∗ = (1 − η) dθ(s) (1−η)πθ∗ (ai|s) + ηπ 0 (ai|s) A (s, a ) i θi φ,I\{i0} I\{i0} s,aI i∈I\{i0} X Y ∗ 0 ∗ 0  θ + η dθ(s)πθ (ai0 |s) (1 − η)πθ (ai|s) + ηπθ (ai|s) A (s, aI ). i0 i i φ,I s,aI i∈I\{i0} According to (21), we know that: Aθ∗ (s, a ) = 0, φ,I\{i0} I\{i0} thus ∗ ∗ ∗  (1 − γ) Φ(θI , θ−I ) − Φ(θI , θ−I ) = X Y ∗ 0 ∗ 0  θ η dθ(s)πθ (ai0 |s) (1 − η)πθ (ai|s) + ηπθ (ai|s) A (s, aI ). i0 i i φ,I s,aI i∈I\{i0} Applying similar procedures recursively and using the fact that: θ∗ Aφ,I\{i}(s, aI\{i}) = 0, ∀ i ∈ I, we get: |I| ∗ ∗ ∗ η X Y θ∗ Φ(θI , θ ) − Φ(θ , θ ) = dθ(s) πθ0 (ai|s)A (s, aI ). −I I −I 1 − γ i φ,I s,aI i∈I

Set π 0 (a |s) as: θi i  ∗ ∗ 1 ai = ai πθ0 (ai|s ) = i 0 otherwise ∗ π 0 (a |s) = π ∗ (a |s), s 6= s , θi i θi i ∗ ∗ where s , ai are defined in (19). Then: η|I| Φ(θ , θ∗ ) − Φ(θ∗ , θ∗ ) = d (s∗)Aθ∗ (s∗, a∗ ) > 0, I −I I −I 1 − γ θ φ,I I which completes the proof.

18 F Proof of Theorem 5

Proof. (Lemma 3) For a given strict NE θ∗ randomly set:

∗ θ∗ ai (s) ∈ argmax Ai (s, ai), ai and set θi be: ∗ θs,ai = 1{ai = ai (s)}. ∗ And set θ := (θi, θ−i) From performance difference lemma (14):

∗ ∗ ∗ X θ∗ Ji(θi, θ−i) − Ji(θi , θ−i) = dθ(s)πθi (s, ai)Ai (s, ai) s,ai X θ∗ = dθ(s) max Ai (s, ai) ≥ 0 ai s

∗ ∗ θ∗ Because θ is a strict NE, thus the inequality above forces θi = θ, and that maxai Ai (s, ai) = 0. ∗ ∗ The uniqueness of θ also implies uniqueness of ai (s), and thus,

θ∗ ∗ Ai (s, ai) < 0, ∀ ai 6= ai (s), which completes the proof of the lemma.

The proof of Theorem 5 relies on the following auxiliary lemma, whose proof we defer to Appendix H. Lemma 6. Let X denote the probability simplex of dimension n. Suppose θ ∈ X , g ∈ Rn and that there exists i∗ ∈ {1, 2, . . . , n} and ∆ > 0 such that: ∗ θi∗ ≥ θi, ∀i 6= i ∗ gi∗ ≥ gi + ∆, ∀i 6= i . Let 0 θ = P rojX (θ + g), then: 0 ∆ θ ∗ ≥ min{1, θ ∗ + } i i 2

Proof. (Theorem 5) For a fixed agent i and state s, the gradient play (8) update rule of policy θi,s is given by: (t+1) (t) η θ(t) θ = P roj (θ + d (t) (s)Q (s, ·)), (22) i,s ∆(|Ai|) i,s 1 − γ θ i

θ(t) where ∆(|Ai|) denotes the probability simplex in |Ai|-th dimension and Qi (s, ·)) is a |Ai|-th θ(t) dimensional vector with ai-th element equals to Qi (s, ai)). We will show that this update rule satisfies the conditions in Lemma 6, which will then allow us to prove that

∗ η∆θ D(θ(t+1)||θ∗) ≤ max{0,D(θ(t)||θ∗) − }. 2 ∗ Letting ai (s) be the same definition as (11), we have that:

1 θ(t) ∗ 1 θ(t) d (t) (s)Q (s, a (s)) − d (t) (s)Q (s, a ) 1 − γ θ i i 1 − γ θ i i

1 θ∗ ∗ 1 θ∗ ≥ d ∗ (s)Q (s, a (s)) − d ∗ (s)Q (s, a ) 1 − γ θ i i 1 − γ θ i i

1 θ∗ ∗ 1 θ(t) ∗ − dθ∗ (s)Q (s, a (s)) − d (t) (s)Q (s, a (s)) 1 − γ i i 1 − γ θ i i

1 θ∗ 1 θ(t) − dθ∗ (s)Q (s, ai) − d (t) (s)Q (s, ai) 1 − γ i 1 − γ θ i

19 1  θ∗ ∗ θ∗  (t) ∗ ≥ d ∗ (s) A (s, a (s) − A (s, a )) − 2k∇ J (θ ) − ∇ J (θ )k (23) 1 − γ θ i i i i θi i θi i n ! ∗ 4 X ≥∆θ − |A | kθ(t) − θ∗k (24) (1 − γ)3 i i=1 n ! n ∗ 4 X X X (t) ≥∆θ − |A | kθ − θ∗ )k (1 − γ)3 i i,s i,s 1 i=1 i=1 s n ! ∗ 4 X ≥∆θ − n|S| |A | D(θ(t)||θ∗), (1 − γ)3 i i=1 where (23) to (24) uses smoothness property in Lemma 7. We use proof of induction as supposed for ` ≤ t − 1, we have:

∗ η∆θ D(θ(`+1)||θ∗) ≤ max{D(θ(`)||θ∗) − , 0}, 2 thus θ∗ 3 (t) ∗ (0) ∗ ∆ (1 − γ) D(θ ||θ ) ≤ D(θ ||θ ) ≤ Pn . 8n|S| ( i=1 |Ai|) Then we can further conclude that:

θ(t) ∗ θ(t) (1 − γ)dθ(t) (s)Qi (s, ai (s)) − (1 − γ)dθ(t) (s)Qi (s, ai) n ! ∗ 4 X ≥∆θ − n|S| |A | D(θ(t)||θ∗) (1 − γ)3 i i=1 ∗ ∆θ ≥ , ∀ a 6= a∗(s) 2 i i

θ∗ 3 (t) ∗ ∆ (1−γ) Additionally, for D(θ ||θ ) ≤ Pn , we may conclude: 8n|S|( i=1 |Ai|)

(t) (t) ∗ θ ∗ ≥ 1/2 ≥ θs,a ∀ai 6= ai (s), s,ai (s) i then by applying Lemma 6 to (22) we have:

θ∗ (t+1) (t) η∆ θs,a∗(s) ≥ min{1, θs,a∗(s) + } i i 4 (t+1) ∗  (t+1)  =⇒ kθi,s − θi,sk1 = 2 1 − θ ∗ s,ai (s) ∗ η∆θ ≤ max{0, kθ(t) − θ∗ k − }, ∀ s ∈ S, i = 1, 2, . . . , n i,s i,s 1 2 ∗ η∆θ =⇒ D(θ(t+1)||θ∗) ≤ max{0,D(θ(t)||θ∗) − }, 2 which completes the proof.

G Smoothness

Lemma 7. (Smoothness for Direct Distributed Parameterization) Assume that 0 ≤ ri(s, a) ≤ 1, ∀s, a, i = 1, 2, . . . , n, then: n ! 2 X kg(θ0) − g(θ)k ≤ |A | kθ0 − θk, (25) (1 − γ)3 i i=1 where g(θ) = {∇θi Ji(θ)}. The proof of Lemma 7 depends on the following lemma:

20 Lemma 8. n 2 p X q k∇ J (θ0) − ∇ J (θ)k ≤ |A | |A |kθ0 − θ k (26) θi i θi i (1 − γ)3 i j j j j=1

Lemma 7 is a simple corollary of Lemma 8.

Proof. (Proof of Lemma 7)

n 0 2 X 0 2 kg(θ ) − g(θ)k = k∇θi Ji(θ ) − ∇θi Ji(θ)k i=1 2 2  n   2  X X q ≤ |A | |A |kθ0 − θ k (1 − γ)3 i  j j j  i j=1

2  n   n   2  X X X ≤ |A | |A | kθ0 − θ k2 (1 − γ)3 i  j   j j  i j=1 j=1

2 n !2  2  X = |A | kθ0 − θk2, (1 − γ)3 i i=1 which completes the proof.

Lemma 8 is equivalent to the following lemma: Lemma 9. 0 0 n ∂Ji(θi + αui, θ−i) − ∂Ji(θi + αui, θ−i) 2 p X q 0 ≤ |Ai| |Aj|kθj−θjk, ∀kuk = 1 ∂α α=0 (1 − γ)3 j=1 (27)

Proof. (Lemma 9) Define:

πi,α(ai|s) := πθi+αui (ai|s) = θs,ai + αuai,s π0 (a |s) := π0 (a |s) = θ0 + αu i,α i θi+αui i s,ai ai,s

πα(a|s) := πθi+αui (ai|s)πθ−i (a−i|s) 0 0 π (a|s) := π 0 (a |s)π (a |s) α θi+αui i θ−i −i α Qi (s, a) := Q(θi+αui,θ−i)(s, a) 0 d (s) := d 0 (s) α (θi+αui,θ−i) According to cost difference lemma, 0 0 ∂Ji(θi + αui, θ−i) − ∂Ji(θi + αui, θ−i)

∂α α=0 P 0 α 1 ∂ s,a dα0 (s)πα(a|s)Ai (s, a) = 1 − γ ∂α α=0 P 0 α 1 ∂ s,a dα0 (s)(πα(a|s) − πα(a|s)) Qi (s, a) = 1 − γ ∂α α=0 

 0 1  X 0 ∂πα(a|s) − ∂πα(a|s) θ ≤  dθ(s) Qi (s, a) 1 − γ  ∂α α=0  s,a | {z } Part A

21 α X 0 0 ∂Qi (s, a) + dθ(s)(πθ(a|s) − πθ(a|s)) ∂α α=0 s,a | {z } Part B 

0  X ∂dα(s) 0 θ  + (πθ(a|s) − πθ(a|s))Qi (s, a)  ∂α α=0  s,a  | {z } Part C Thus: 0 X 0 ∂πα(a|s) − ∂πα(a|s) θ Part A = dθ(s) Qi (s, a) ∂α α=0 s,a

X 0 θ = d (s)ua ,s(πθ0 (a−i|s) − πθ (a−i|s))Q (s, a) (28) θ i −i −i i s,a

1 X 0 X X 0 ≤ d (s) |uai,s| πθ (a−i|s) − πθ−i (a−i|s) (29) 1 − γ θ −i s ai a−i ! 1 X X 0 0 ≤ max |uai,s| dθ(s)2dTV(πθ (·|s)||πθ−i (·|s)) (30) 1 − γ s −i ai s ! 1 X X 0 X 0 ≤ max |uai,s| dθ(s) 2dTV(πθ (·|s)||πθj (·|s)) (31) 1 − γ s j ai s j6=i ! 1 X X 0 X 0 = max |uai,s| dθ(s) kθj,s − θj,sk1 (32) 1 − γ s ai s j6=i 1 p X X q ≤ |A | d0 (s) |A |kθ0 − θ k (33) 1 − γ i θ j j,s j,s s j6=i q s s 1 p X X 2 X 0 2 ≤ |A | |A | d 0 (s) kθ − θ k (34) 1 − γ i j θ j,s j,s j6=i s s q s 1 p X X 2 0 = |A | |A | d 0 (s) kθ − θ k 1 − γ i j θ j j j6=i s 1 X q ≤ p|A | |A |kθ0 − θ k 1 − γ i j j j j6=i n 1 X q ≤ p|A | |A |kθ0 − θ k, 1 − γ i j j j j=1 θ 1 where (28) to (29) is derived from the fact that |Qi (s, a)| ≤ 1−γ . (30) to (31) relies on the property of total variation distance: X d (π 0 (·|s)||π (·|s)) ≤ d (π 0 (·|s)||π (·|s)) TV θ−i θ−i TV θj θj j6=i (32) to (33) is derived from: X p max |ua ,s| ≤ |Ai|, kuk ≤ 1 s i ai 0 q 0 kθj,s − θj,sk1 ≤ |Aj|kθj,s − θj,sk which can be immediately verified by applying Cauchy-Schwarz inequality.

22 Before looking into Part B, we first define Pe(α) as the state-action under πα:

h i 0 0 0 Pe(α) = πα(a |s )P (s |s, a) (s,a)→(s0,a0) Then we have that: " # ∂Pe(α) 0 0 0 0 0 = ua ,s πθ−i (a−i|s )P (s |s, a) ∂α α=0 i (s,a)→(s0,a0) For an arbitrary vector x: " # ∂Pe(α) X 0 0 0 0 0 0 0 x = ua ,s πθ−i (a−i|s )P (s |s, a)xs ,a ∂α α=0 i (s,a) s0,a0 X 0 0 0 ≤ kxk |u 0 0 |π (a |s )P (s |s, a) ∞ ai,s θ−i −i s0,a0 X 0 X X 0 0 = kxk P (s |s, a) |u 0 0 | π (a |s ) ∞ ai,s θ−i −i 0 0 0 s ai a−i X 0 p X 0 0 ≤ kxk∞ P (s |s, a) |Ai| πθ−i (a−i|s ) 0 0 s a−i p ≤ |Ai|kxk∞

Thus: " # ∂Pe(α) p x ≤ |Ai|kxk∞ ∂α α=0 (s,a) ∞ 0 0 Similarly we can define Pe(α) as the state-action under πα, and can easily check that

" 0 # ∂Pe(α) p x ≤ |Ai|kxk∞ ∂α α=0 (s,a) ∞ Define:  −1  −1 M(α) := I − γPe(α) ,M(α)0 := I − γPe(α)0 . Because: ∞  −1 X M(α) = I − γPe(α) = γnPe(α), n=0 1 which implies that every entry of M(α) is nonnegative and M(α)1 = 1−γ 1, this implies: 1 kM(α)xk ≤ kxk , ∞ 1 − γ ∞ and similarly 1 kM(α)0xk ≤ kxk . ∞ 1 − γ ∞ Now we are ready to bound Part B. Because: α > Qi (s, a) = e(s,a)M(α)ri ∂Qα(s, a) ∂M(α) ∂Pe(α) =⇒ i = e> r = γe> M(α) M(α)r ∂α (s,a) ∂α i (s,a) ∂α i α ∂Qi (s, a) ∂Pe(α) =⇒ ≤ γ M(α) M(α)ri ∂α ∂α ∞ γ ≤ p|A | (1 − γ)2 i

23 Thus, α X 0 0 ∂Qi (s, a) Part B = dθ(s)(πθ(a|s) − πθ(a|s)) ∂α α=0 s,a α X 0 0 ∂Qi (s, a) ≤ dθ(s) |πθ(a|s) − πθ(a|s)| ∂α α=0 s,a γ p X 0 ≤ |A | d (s)2d (π 0 (·|s)||π (·|s)) (1 − γ)2 i θ TV θ θ s γ p X 0 X ≤ |Ai| d (s) 2dTV(πθ0 (·|s)||πθ (·|s)) (1 − γ)2 θ j j s j γ X X = p|A | d0 (s) kθ0 − θ k (1 − γ)2 i θ j,s j,s 1 s j n γ X X q ≤ p|A | d0 (s) |A |kθ0 − θ k (1 − γ)2 i θ j j,s j,s s j=1 n s s γ X q X X ≤ p|A | |A | d0 (s)2 kθ0 − θ k2 (1 − γ)2 i j θ j,s j,s j=1 s s n γ X q ≤ p|A | |A |kθ0 − θ k (1 − γ)2 i j j j j=1

Now let’s look at Part C: 0 X 0 X 0 0 0 > 0 X dα(s) = (1 − γ) ρ(s ) πα(a |s )e(s0,a0)M(α) e(s,a00) s0 a0 a00 

0  0 0 0 ∂dα(s) X 0 X ∂πα(a |s ) > 0 X =⇒ = (1 − γ)  ρ(s ) e 0 0 M(α) e 00 ∂α  ∂α (s ,a ) (s,a )  0 0 00  s a a | {z> } v1 

0  X 0 X 0 0 0 > ∂M(α) X  + ρ(s ) π (a |s )e 0 0 e 00  α (s ,a ) ∂α (s,a ) 0 0 00  s a a  | {z> } v2 ! > 0 > 0 ∂Pe(α) 0 X = (1 − γ) v M(α) + γv M(α) M(α) e 00 1 2 ∂α (s,a ) a00

Note that v1, v2 are constant vectors that are independent of the choice of s. Additionally:

X X ∂π0 (a|s) kv k = ρ(s) α e 1 1 ∂α (s,a) s a 1 0 X X ∂πα(a|s) = ρ(s) ∂α s a X X = ρ(s) |u | π 0 (a |s) ai,s θ−i −i s a X X p ≤ ρ(s) |uai,s| ≤ |Ai| s a X X 0 kv2k1 = k ρ(s) πα(a|s)e(s,a)k1 s a

24 X X 0 = ρ(s) πα(a|s) = 1 s a Thus: 0 X ∂dα(s) 0 θ Part C = (πθ(a|s) − πθ(a|s))Qi (s, a) ∂α α=0 s,a

! > 0 > 0 ∂Pe(α) 0 X X 0 θ = (1 − γ) v1 M(0) + γv2 M(0) M(0) e(s,a0)(πθ(a|s) − πθ(a|s))Qi (s, a) ∂α α=0 s,a a0 | {z } v3  1 γ  ≤ (1 − γ) kv k kv k + p|A |kv k kv k 1 − γ 1 1 3 ∞ (1 − γ)2 i 2 1 3 ∞ p|A | ≤ i kv k 1 − γ 3 ∞ Additionally:

X θ [v ] = (π 0 (a|s ) − π (a|s ))Q (s , a) 3 (s0,a0) θ 0 θ 0 i 0 a 1 X ≤ |π 0 (a|s ) − π (a|s )| 1 − γ θ 0 θ 0 a 1 = 2d (π 0 (·|s )||π (·|s )) 1 − γ TV θ 0 θ 0 n 1 X ≤ 2dTV(πθ0 (·|s0)||πθ (·|s0)) 1 − γ j j j=1 n 1 X = kθ0 − θ k 1 − γ j,s j,s 1 j=1 n 1 X q ≤ |A |kθ0 − θ k 1 − γ j j,s j,s j=1 n 1 X q ≤ |A |kθ0 − θ k 1 − γ j j j j=1 Combining the above inequalities we get: p p n |Ai| |Ai| X q Part C ≤ kv k ≤ |A |kθ0 − θ k 1 − γ 3 ∞ (1 − γ)2 j j j j=1 Sum up Part A-C we get: 0 0 ∂Ji(θi + αui, θ−i) − ∂Ji(θi + αui, θ−i) 1 ≤ (Part A + Part B + Part C) ∂α α=0 1 − γ n 2 X q ≤ p|A | |A |kθ0 − θ k, (1 − γ)3 i j j j j=1 which completes the proof.

H Auxiliary

We recall Lemma 6.

25 Lemma 6. Let X denote the probability simplex of dimension n. Suppose θ ∈ X , g ∈ Rn and that there exists i∗ ∈ {1, 2, . . . , n} and ∆ > 0 such that: ∗ θi∗ ≥ θi, ∀i 6= i ∗ gi∗ ≥ gi + ∆, ∀i 6= i . Let 0 θ = P rojX (θ + g), then: 0 ∆ θ ∗ ≥ min{1, θ ∗ + } i i 2

Proof. Let y = θ + g, without loss of generality, assume that i∗ = 1 and that:

y1 > y2 ≥ y3 ≥ · · · ≥ yn.

Using KKT condition, one can derive an efficient algorithm for solving P rojX (y) [Wang and Carreira-Perpiñán, 2013], which consists of the following steps:

1  Pj  1. Find ρ := max{1 ≤ j ≤ n : yj + j 1 − i=1 yi > 0};

1 Pρ 2. Set λ := ρ (1 − i=1 yi); 0 3. Set θi = max{yi + λ, 0}.

From the algorithm, we have that: ρ ! ρ ! 1 X 1 X λ = 1 − y = 1 − (θ + g ) ρ i ρ i i i=1 i=1 ρ ! ρ 1 X 1 X = 1 − θ − g ρ i ρ i i=1 i=1 ρ 1 X ≥ − g . ρ i i=1 If ρ ≥ 2, ρ 1 X θ0 = max{y + λ, 0} ≥ y + λ ≥ θ + g − g 1 1 1 1 1 ρ i i=1 ρ 1 1 X ρ − 1 ∆ ≥ θ + (1 − )g − (g − ∆) = θ + ∆ ≥ θ + . 1 ρ 1 ρ 1 1 ρ 1 2 i=2 If ρ = 1, 0 θ1 = y1 + λ = y1 + (1 − y1) = 1. Thus: ∆ θ0 ≥ min{1, θ + }, 1 1 2 which completes the proof.

I Numerical Simulation Details

Verification of the fully mixed NE in Game 1 We now verify that joining network 1 with proba- 1−3 bility 3(1−2) ,i.e.: 1 − 3 π (a = 1|s) = , ∀s ∈ S, i = 1, 2, θi i 3(1 − 2)

26 is indeed a NE. First, observe that  1 − 3   1 − 3  Prθ(s = 1) = P (s = 1|a = 1) + 1 − P (s = 1|a = 2) i,t+1 3(1 − 2) i,t+1 i,t 3(1 − 2) i,t+1 i,t  1 − 3   1 − 3  1 = (1 − ) + 1 −  = . 3(1 − 2) 3(1 − 2) 3 Thus, ∞ X 2γ V (s) = r(s) + γtr(s ) = r(s) + , Est t 3(1 − γ) t=1 θ X 0 0 Qi (s, ai) = r(s) + γ P (s |ai, a−i)πθ−i (a−i|s)V (s ) 0 s ,a−i 2 X 0 θ 0 θ 2γ = r(s) + γ (P (si|ai)Pr (s−i = 1)r(si, s−i = 1) + P (si|ai)Pr (s−i = 2)r(si, s−i = 2)) + 0 3(1 − γ) si∈{1,2} 1 2  = r(s) + γP (s0 = 1|a ) r(s0 = 1, s = 1) + r(s0 = 1, s = 2) i i 3 i −i 3 i −i 1 2  2γ2 + γP (s0 = 2|a ) (s0 = 2, s = 1) + r(s0 = 2, s = 2) + i i 3 i −i 3 i −i 3(1 − γ) 2 2γ2 2γ = r(s) + γ + = r(s) + = V (s), 3 3(1 − γ) 3(1 − γ) which implies that: 0 > 0 (θi − θi) ∇θi Ji(θ) = 0, ∀θi ∈ Xi, i = 1, 2, i.e. θ satisfies first order stationarity. Since dθ(s) > 0 holds for any valid θ, by Theorem 1, θ is a NE.

Computation strict NEs in Game 1 The computation of strict NEs is done numerically, using the criterion in Lemma 3. We enumerate over all 28 possible deterministic policies and check whether the conditions in Lemma 3 hold. For  = 0.1, γ = 0.95, and an initial distribution set as:

ρ(s1 = i, s2 = j) = 1/4, i, j ∈ {1, 2}, the numerical calculation shows there exist 13 different strict NEs.

27