Decentralized Policy Gradient Tracking for Safe Multi-Agent

Decentralized Policy Gradient Descent Ascent for Safe Multi-Agent Reinforcement Learning

Songtao Lu1, Kaiqing Zhang2, Tianyi Chen3, Tamer Ba¸sar2, Lior Horesh1 1IBM Thomas J. Watson Research Center, Yorktown Heights, New York 10598, USA 2University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, USA 3Rensselaer Polytechnic Institute, Troy, New York 12144, USA [email protected], [email protected], [email protected], [email protected], [email protected]

Abstract Background of Multi-Agent RL This paper deals with distributed reinforcement learning prob- The studies of MARL can be traced back to Q-learning lems with safety constraints. In particular, we consider that a in (Claus and Boutilier 1998) and (Wolpert, Wheeler, and team of agents cooperate in a shared environment, where each Tumer 1999), with applications to network routing (Boyan agent has its individual reward function and safety constraints and Littman 1994) and power network control (Schneider that involve all agents’ joint actions. As such, the agents aim et al. 1999). However, all the algorithms involved in these to maximize the team-average long-term return, subject to all works are heuristic without performance guarantees. Recent the safety constraints. More intriguingly, no central controller empirical results of deep multi-agent collaborative RL algo- is assumed to coordinate the agents, and both the rewards rithms can also be found in (Gupta, Egorov, and Kochenderfer and constraints are only known to each agent locally/privately. Instead, the agents are connected by a peer-to-peer communi- 2017; Lowe et al. 2017; Omidshafiei et al. 2017). One of the cation network to share information with their neighbors. In earliest distributed RL algorithm with convergence guaran- this work, we first formulate this problem as a distributed con- tees was reported in (Lauer and Riedmiller 2000), which is strained Markov decision process (D-CMDP) with networked tailored to the tabular multi-agent Markov decision process agents. Then, we propose a decentralized policy gradient (PG) (MDP) setting, and another one (Nguyen et al. 2014). Then, method, Safe Dec-PG, to perform policy optimization based a distributed Q-learning algorithm was developed with being on this D-CMDP model over a network. Convergence guaran- provably able to learn the desired value function and the opti- tees, together with numerical results, showcase the superiority mal stationary control policy at each network agent through of the proposed algorithm. To the best of our knowledge, this a consensus network, where each agent can only communi- is the first decentralized PG algorithm that accounts for the cate with their neighbors (Kar, Moura, and Poor 2013). In coupled safety constraints with a quantifiable convergence rate in multi-agent reinforcement learning. Finally, we emphasize the same setup, fully decentralized actor-critic algorithms that our algorithm is also novel in solving a class of decen- with function approximation were developed in (Zhang et al. tralized stochastic nonconvex-concave minimax optimization 2018) to handle large or even continuous state-action spaces. problems, where both the algorithm design and corresponding However, the convergence in (Zhang et al. 2018) was again theoretical analysis are of independent interest. established in an asymptotic sense. For a fixed policy, decentralized policy evaluation (value function approximations) approaches for MARL have been studied in (Wai et al. 2018; Introduction Doan, Maguluri, and Romberg 2019; Qu et al. 2019). (Please Reinforcement learning (RL) has achieved tremendous suc- see also the recent surveys (Zhang, Yang, and Ba¸sar2019; cess in many sequential decision-making problems in (Mnih Lee et al. 2020) and references therein.) et al. 2015; Sutton and Barto 2018), such as operations research, optimal control, bounded rationality, machine learn- Related Work ing, etc., where an agent explores the interactions with an Decentralized and distributed algorithms with quantifiable environment so that it is able to maximize a cumulative re- convergence rate guarantees in the optimization community ward through this learning process. Beyond applying the clas- have been developed for many decades (Nedic, Ozdaglar, sical RL techniques in control systems, physical constraints and Parrilo 2010) in various scenarios, including (strongly) or safety considerations will also be the key components of convex and non-convex cases. Recent advances in distributed determining the performance of an RL system. Especially, non-convex optimization show that decentralized stochastic this is more important in multi-agent RL (MARL) that mod- gradient descent or tracking (DSGD/DSGT) is able to train els the sequential decision-making of multiple agents in a neural networks much faster than the centralized algorithms shared environment, while each agent’s objective and the in terms of running time numerically (Lian et al. 2017; Lu system evolution are both affected by the decisions made by et al. 2019). Also, it has been indicated in theory that there is all agents (Nguyen et al. 2014). a linear speed-up of performing decentralized optimization Copyright c 2021, Association for the Advancement of Artificial compared with the centralized one in terms of the number of Intelligence (www.aaai.org). All rights reserved. nodes (Lian et al. 2017; Tang et al. 2018; Lu and Wu 2020). Algorithm Rate Decentralized Implementation PGSMD (Rafique et al. 2018) O −6 double-loop GDA (Lin, Jin, and Jordan 2020) O −8 single-loop Safe Dec-PG (this work) O −4 single-loop

Table 1: A comparison of stochastic non-convex concave minmax algorithms with convergence to the ﬁrst-order game-stationary points (FOSPs).

Moreover, in practice, the data would be collected through non-convex case. However, there is no theoretical guaran- the sensors over a network, so the distributed learning be- tee that HiBSA is amenable to handle the stochasticity of comes one of the most powerful signal, data, and information the samples in the non-convex (strongly) concave min-max processing tools. (Please see a survey (Chang et al. 2020) of problems. Further, all these algorithms are centralized, so recent distributed non-convex optimization algorithms and it is not clear whether they can be used for a multi-agent their applications.) However, the safe RL problem is not on- system. Recently, there are some interesting works regard- ly maximizing rewards but also takes practical issues into ing the distributed training for a class of GANs (Liu et al. account or introduces some prior knowledge of the mod- 2020a,b), where the problem is formulated as a decentralized el in advance, where there would be multiple cumulative non-convex saddle-point problem. But both of them require long-term reward functions incorporated as the constraints that the objective function satisfy the Minty variational in- (Paternain et al. 2019a; Wachi and Sui 2020). Unfortunately, equality (MVI), otherwise, these methods cannot converge none of the existing works deal with the safety constraints to an -first-order stationary point (FOSP) of the considered that are also non-convex, no need to mention their distributed problem even the number of iterations is infinite. While in implementation over a network. RL/MARL there is no evidence which can indicate that discounted cumulative reward function satisfies MVI again due By the primal-dual optimization framework, the safe RL to the nonconvexity of the loss function when the policy at problem can be formulated as a min-max saddle point form each node is parametrized by a neural net. by the method of Lagrange multipliers or dualizing the constraints (Boyd and Vandenberghe 2004). However, different from the classical supervised learning, e.g., support vector Main Contributions machine and least squares regression, the policy in RL is In this work, by leveraging the min-max saddle-point formu- mostly parametrized by a (deep) neural network so that the lation, we propose the first safe decentralized policy gradient cumulative reward functions are non-convex. Hence, the du- (PG) descent and ascent algorithm, i.e., Safe Dec-PG, which ality gap in this case is not zero in general, which makes the is able to deal with a class of multi-agent safe RL problems optimization process much more difficult than the traditional over a graph. Importantly, we provide theoretical results that convex-concave min-max problem even in the centralized quantify the convergence rate of Safe Dec-PG to an -first- setting. Interestingly, some recent exciting results illustrate order stationary points (FOSP) of the considered non-convex that the duality gap in safe RL problems could be zero (Pa- 4 min-max problem in the order of 1/ (or equivalently√ the ternain et al. 2019b) by assuming some oracle that can find optimality gap is shrinking in the order of 1/ N, where N the global optimal solution of the Lagrangian with respect to denotes the total number of iterations). When the graph is policy. It is inspiring that safe RL might be solved efficiently fully connected in the sense that there is no consensus error to high-quality solutions by the non-convex min-max solvers. (each agent can know all the other agents’ policy at each During the last few years, solving non-convex min-max iteration), Safe Dec-PG will reduce to a centralized algorith- saddle-point problems has gained huge popularity and indicat- m. Even in this case, the obtained convergence rate is still ed significant power of optimizing the interest of parameters the state-of-the-art result to the best of our knowledge. A in many machine learning and/or artificial intelligence prob- more detailed comparison between proposed Safe Dec-PG lems, including adversarial learning, robust neural nets or and other existing stochastic non-convex concave min-max generative adversarial nets (GANs) training, fair resource algorithm in the centralized setting is shown in Table ??. The allocation (Razaviyayn et al. 2020). The main idea of de- main advantages of Safe Dec-PG are highlighted as follows: signing these algorithms is to perform gradient descent and I (Simplicity) The structure of implementing the algorithm ascent with respect to the objective functions, such as gra- is single-loop, where the parameters that need to be tuned dient descent ascent (GDA) algorithm (Lin, Jin, and Jordan are only the stepsizes in the minimization and maximiza- 2020), multi-GDA (Nouiehed et al. 2019), proximally guided tion subproblems. stochastic mirror descent method (PGSMD) (Rafique et al. I (Theoretical Guarantees) It is theoretically provable that 2018), and hybrid block successive approximation (HiBSA) Safe Dec-PG is able to find an -FOSP of the formulated (Lu et al. 2020). The difference between GDA and multi- non-convex min-max problem within O(1/4) number GDA is that the latter performs multiple steps of gradient as- of iterations, matching the standard convergence rate of cent updates instead of one. Among these algorithms, HiBSA centralized stochastic gradient descent (SGD) and decen- achieves the fastest convergence rate with only a single loop tralized SGD to -FOSPs in non-convex scenarios. update rule to optimization variables for the deterministic I (Applicability) Safe Dec-PG is also a general optimization problem solver, which can be applied for dealing agents is to collaboratively maximize the globally average with many non-convex min-max problems rather than the return over the network (equivalently to minimize the oppo- −1 P RL/MARL problems, and it could be implemented in ei- site of it), dictated by R(ss,aa) = n · i∈N Ri(ss,aa), with ther a decentralized way over a network or on a single only its local observations of the rewards, subject to some machine. safety constraints dictated by Ci(ss,aa). At each node, there Multiple numerical results showcase the superiority of the would be multiple safety constraints. These rewards describe algorithms applied in the problems of safe decentralized RL different objectives that the agent is required to achieve, such compared with the classic decentralized methods without as remaining with a region of the state space, or not running safety considerations. Due to the page limitation, proofs of out of memory/battery. Here, we assume that each agent is all the lemmas, the main theorem and additional numerical associated with m cost functions, so Ci(ss,aa) is a mapping results are included in the supplemental materials. from S × A to Rm. Specifically, the team aims to find the joint policy πθ that Safe MARL with Decentralized Agents R 1 X t X t t 0 min J0 (θ),E − γ Ri(s ,aa ) s , πθ (1a) In this section, we introduce the background and formulation θ∈Θ n of the safe MARL problem with decentralized agents. t≥0 i∈N C X t t t 0 s.t. Ji (θ) E γ Ci(s ,aa ) s , πθ ≥ ci, ∀i ∈ N Multi-Agent Constrained Markov Decision Process , t≥0 (M-CMDP) (1b) Consider a team of n agents operating in a common envi- QN ronment, denoted by N = [n]. No central controller exists where Θ = i=1 Θi is the joint policy parameter space, R to either make the decisions or collect any information for J0 (θ) corresponds to the negative team-average discount- C d m the agents. Agents are instead allowed to communicate with ed long-term return, Ji (θ): R → R denotes the long- m each other over a communication network G = (N , E), with term costs of agent i, ci ∈ R , ∀i are the lower-bounds of C E being the set of communication links that connect the a- Ji (θ), ∀i that impose the safety constraints, and E is taken gents. Such a decentralized model with networked agents over all randomness including the policy and the underlying finds broad applications in distributed cooperative control Markov chain. Each agent i only has access to its own re- problems (Fax and Murray 2004; Corke, Peterson, and Rus ward and cost Ri and Ci, and the desired bound ci. Note that 2005; Dall’Anese, Zhu, and Giannakis 2013), and has been our ensuing results can be straightforwardly generalized to advocated as one of the most popular paradigms in decen- the setting where each agent has different number of costs, tralized MARL (Zhang et al. 2018; Wai et al. 2018; Doan, at the expense of unnecessarily complicated notations. In R Maguluri, and Romberg 2019; Qu et al. 2019; Zhang, Yang, general, the long-term return J0 (θ) is non-convex with re- and Ba¸sar2019; Lee et al. 2020). More importantly, each spect to the policy parameter θ (Zhang et al. 2020; Liu et al. agent has some safety constraints, in the forms of bound- 2019; Agarwal et al. 2020), so do the constraint functions C s on some long term cost, that involve the joint policy of Ji (θ), ∀i, which makes the problem challenging to solve all agents. We formally introduce the following model of using the first-order PG methods. networked multi-agent constrained MDP (M-CMDP) to char- acterize this setting. Primal-Dual for Safe M-CMDP Definition 1 (Networked Multi-agent CMDP (M-CMDP)). Viewing the team as a single agent, the problem above falls A networked multi-agent CMDP is described by a tuple into the regime of the standard constrained MDP (Altman (S, {A } ,P, {R } , G, {C } , γ) where S is the s- 1999), which has been widely studied in single-agent safe i i∈N i i∈N i i∈N RL. Nonetheless, in a decentralized paradigm, standard RL tate space shared by all the agents, Ai is the action space of agent i, and G is a communication network (a well-connected algorithms for solving CMDP are not applicable, as they Qn require the instantaneous access to the team-average reward graph). Let A = i=1 Ai be the joint action space of al- and all cost functions {Ci}i∈N (Borkar 2005; Prashanth and l agents; then, Ri : S × A → R and Ci : S × A → R are the local rewards and cost functions of agent i, and Ghavamzadeh 2016; Achiam et al. 2017; Yu et al. 2019; P : S × A × S → [0, 1] is the state transition probability of Paternain et al. 2019b). Instead, we re-formulate the problem the MDP. γ ∈ (0, 1) denotes the discount factor. The states s as a decentralized non-convex optimization problem with and actions a are globally observable, while the rewards and non-convex constraints, in order to develop decentralized R costs are observed locally/privately at each agent. policy optimization algorithms. In particular, letting Ji (θ) , P t t t 0 E(− t≥0 γ Ri(s ,aa ) |s , πθ ), we have the networked M- The networked M-CMDP proceeds as follows. At time CMDP as t, each agent i chooses its own action at given st, accord- i 1 ing to its local policy πi : S → ∆(Ai), which is usually X R min Ji (θi) (2) {θi∈Θ} n parametrized as πwi by some parameter wi ∈ Θi with di- i∈N mension di. The networked agents try to learn a joint policy C Q s.t. θi = θj j ∈ Ni, ci − Ji (θi) ≤ 0, ∀i ∈ N , πwi : S → ∆(A) given by πθ (ss,aa) = i∈N πwi (ss,aai) > > > d Pn with θ = [w1 ... wn ] ∈ R , where d = i=1 di denotes where Ni ⊆ N denotes the set of the neighboring agents of the whole problem dimension. As a team, the objective of all agent i over the network, and θi is the local copy of the policy R parameter θ (i.e., the concatenation of all the agents’ parame- agent’s cumulative loss Ji (θi) in (3) can be written as (Bax- ters). By the Lagrangian method (Boyd and Vandenberghe ter and Bartlett 2001) 2004), the problem (2) can be written as " ∞ t ! # R X X τ τ t t t ∇θiJi (θi)= ∇ log πi(ai |s ;θi) γ Ri(s ,aa ) min max L(θ1, . . . ,θθn,λλ1, . . . ,λλn) (3a) E {θi∈Θ} λ≥0 t=0 τ=0 t t s.t. θi = θj j ∈ Ni, ∀i, (3b) where {a ,ss } are obtained from each trajectory under the joint policy (parametrized by {θi, ∀i}). When the MDP mod- where el is unknown, the stochastic estimate of PG is often used, 1 X R that is L(θ1, . . . ,θθn,λλ1, . . . ,λλn) , Ji (θi)+hgi(θi),λλii, ∞ t ! n X X i∈N ∇ J R(θ ) = ∇ log π (aτ |sτ ;θ ) γtR (st,aat), (4) b θi i i i i i i C t=0 τ=0 gi(θi) , ci − Ji (θi), and λ1, . . . ,λλn denote the dual variables. which was proposed in (Baxter and Bartlett 2001) and called the gradient of a partially observable MDP (abbreviated as Main Challenges of Solving Safe Decentralized RL G(PO)MDP PG). The G(PO)MDP gradient is an unbiased estimator of the PG (Papini et al. 2018; Xu, Gao, and Gu To this end, the multi-agent safe RL problem has been for- 2020). mulated as (3). Unfortunately, there is no existing work that Likewise, the stochastic PG estimate of each agent’s is able to solve this problem to its FOSPs with any theoret- J C (θ ) in (3) can be written as ical guarantees. The main difficulties here are four-fold as i i ∞ t ! follows: C X X τ τ t t t ∇b θ J (θi) = ∇ log πi(a |s ;θi) γ Ci(s ,aa ). I There are two types of constraints in this problem: one is i i i the consensus equality constraint and the other one is the t=0 τ=0 long term cumulative reward related inequality constraint. R C Let fi(θi,λλi) , Ji (θi)+hci−Ji (θi),λλii, ∀i for notational I The constraints and loss functions are both in an expected simplicity. Then, the policy gradients with respect to primal discounted cumulative reward form and possibly non- variables are convex, while most of the classical non-convex algorithms, ∇ f (θ ,λλ ) = ∇ J R(θ ) − h∇ J C (θ ),λλ i, ∀i (5) e.g., neural nets training, are designed for the case where b θi i i i b θi i i b θi i i i only the loss functions are non-convex. and the policy gradients with respect to dual variables are The problem is stochastic in nature and the PG estimate I ∇ f (θ ,λλ ) = c − J C (θ ), ∀i (6) is biased instead of unbiased due to the finite-horizon b λi i i i i bi i approximation, so we need extra efforts to quantify how C P∞ t t t 0 where Jbi (θi) , t=0 γ Ci(s ,aa |s , πθ ). Note that the s- biased estimates affect the convergence results. tochastic gradients in (5) and (6) use only one trajectory of I From a min-max saddle-point perspective, the minimiza- the Markov chain, which may incur large variance. Akin tion problem is non-convex and the maximization problem to mini-batch in SGD, a natural solution is to average over is concave (linear), while there would be also a consensus K trajectories to obtain the policy gradient with respect to error coupled with both minimization and maximization K the primal variables denoted as ∇b fi(θi,λλi), ∀i, and with optimization variables. Disentangling this error from the θi respect to the dual variables denoted as ∇K f (θ ,λλ ), ∀i. minimization and maximization processes will result in a b λi i i i significant different theorem proving technique compared In simulations, sampling an infinite trajectory may not be with the existing theoretical works. tractable, and a finite-horizon approximation of the PGs (5) Therefore, solving this family of stochastic non-convex prob- and (6) is usually used (Chen et al. 2018), which are denoted as ∇T,K f (θ ,λλ ) and ∇T,K f (θ ,λλ ). Also, we can have lems over a graph is much more challenging than the classical b θi i i i b λi i i i ones, e.g., centralized min-max saddle-point problems, de- a set of globally observable states and actions denoted by τ τ centralized consensus problems, stochastic non-convex prob- {ak,ssk}, where k denotes the index of trajectories and τ lems, and so on. Next, we will propose the new gradient denotes the index of time. Consequently, the stochastic esti- tracking based single loop primal dual algorithm to deal with mate of PG with K trajectories (samples) and a finite-horizon this M-CMDP problem. truncation of length T can be expressed as T,K T,K R T,K C ∇b fi(θi,λλi)=∇b J (θi) − h∇b J (θi),λλii, (7a) Safe M-CMDP Algorithm θi θi i θi i T,K C T,K ∇ fi(θi,λλi)=ci − (J ) (θi) gi(θi) (7b) First, we introduce the safe policy gradient used in Safe Dec- b λi bi , b PG as the following. where ∇T,K J R(θ ) Safe Policy Gradient b θi i i , K T t ! The search for an optimal policy can thus be performed by 1 X X X ∇ log π(aτ |sτ ;θ ) γtR (st ,aat ), (8) applying the gradient descent-type iterative methods to the K i,k k i i k k parametrized optimization problem (3). The gradient of each k=1 t=0 τ=0 ∇T,K J C (θ ) Algorithm 1 Safe Dec-PG b θi i i , K T t ! θ0 ϑ0 λ0 0 1 X X X Input: θi , ϑi = λi = 0, ∀i ∇ log π(aτ |sτ ;θ ) γtC (st ,aat ), (9) for r = 1,... do K i,k k i i k k k=1 t=0 τ=0 for Each agent i do r+1 Update θi by (10) C T,K −1 PK PT t t t T,K and (J ) (θi) K γ Ci(s ,aa ). Note r r bi , k=1 t=0 k k Perform rollout to get ∇b fi(θi ,λλi ) that the finite length horizontal truncation will make the s- θi Update ϑr+1 by (11) tochastic estimate PG become biased. i C T,K r+1 Calculate (Jbi ) (θi ) λr+1 Safe Dec-PG: Safe Decentralized Policy Gradient Update i by (13) end for After getting the PG estimates, Safe Dec-PG algorithm we end for proposed is given below. For notational simplicity, in the following we assume the problem dimension is 1. We first update the parameters of the parametrized policy at each node lows: D E by λr+1 = arg max ∇T,K f (θr+1,λλr),λλ − λr i b λi i i i i i r+1 X r r r λi≥0 θi = Wijθj − β ϑi , (10) r 1 r 2 γ 2 j∈Ni − kλ − λ k − kλ k , ∀i (12) 2ρ i i 2 i where r denotes the index of the iterations, βr is the stepsize where ρ > 0 is the stepsize of PG ascent in updating λr, of PG descent, ϑr is an auxiliary (tracking) variable (which i i γr (to be defined later) is a diminishing parameter. The per- will be introduced with more details later in (11)), and W ij turbation term γr/2kλ k2 plays one of the most key roles is a weight matrix that characterizes the relations among the i of ensuring the convergence of Safe Dec-PG. It adds some nodes over graph G. (desired) curvature to this subproblem (12) in such a way it Next, we provide detailed descriptions about W and ϑ: is possible to quantify the maximum ascent of our construct- 1) The weight matrix is double stochastic (i.e., the graph ed potential function (a Lyapunov-like function that will be is well-connected.), which is defined as follows: if there used to measure the progress of the proposed algorithm) after exists a link between node i and node j, then Wij > 0, the update of λr. Then, this parameter gradually reduces the > i otherwise Wij = 0, and W satisfies W1 = 1 and 1 W = problem curvature to resemble the original subproblem such 1>. There are many ways of designing the weight matrix that the obtained solution is the FOSP of problem (3) rather a based on the connectivity of the graph. The standard ones deviated one. Note that (12) can also be easily implemented include Metropolis-Hasting weight, maximum-degree weight, locally by Laplacian weight (Xiao and Boyd 2004; Boyd, Diaconis, and r+1 r r T,K r+1 r Xiao 2004); 2) due to the partial observability of each agent, λi =PΛ (1 − ργ )λi + ρ∇b fi(θi ,λλi ) , ∀i (13) r λi the variable ϑi here is proposed for approximating the full PG n −1 P where PΛ denotes the projection operator, and Λ = {λi|λi ≥ of the network (i.e., n i=1 ∇b θi fi(θi,λλi)), and is updated locally as 0}, ∀i stands for the feasible set. It can be seen that one of the major advantages of Safe r+1 X r ϑi = Wijϑj Dec-PG is regarding its simplicity of updating rules for all the j∈Ni parameters: 1) a single loop algorithm; 2) each variable can be T,K r+1 r T,K r r only updated locally through exchanging the parameters over + ∇b fi(θ ,λλ ) − ∇b fi(θ ,λλ ), ∀i (11) θi i i θi i i the communication channel. From the following convergence analysis, we will show that when some mild conditions hold, 0 with ϑi , 0, ∀i. This update rule is similar to the (stochas- Safe Dec-PG is guaranteed to find the FOSPs of problem tic) gradient tracking technique proposed for both classi- (3) by controlling the stepsizes used in the minimization and cal consensus based (deterministic or stochastic) distributed maximization procedures properly. optimization problems (Di Lorenzo and Scutari 2016; Sun, Daneshmand, and Scutari 2019). But here since we also have Performance Analysis of Safe Dec-PG dual variable updates, at each time the evaluated gradient is r Before showing our theoretical results, we first give the stan- also dependent on λi , so it is not clear whether the tracked r dard assumptions as follows. full PG by ϑi is still accurate enough so that the resulting sequence can converge to the stationary points of problem Assumptions (3). In our performance analysis section, we will show the conditions that can ensure the convergence of Safe Dec-PG To begin with, we assume that fi, gi, ∀i satisfy a Lipschitz in solving problem (3). continuous condition. To be more specific, we have In this work, instead of performing a vanilla dual update, Assumption 1. Assume functions ∇fi(θi,λλi), ∀i have L- we propose to add a (quadratic) perturbation term (a.k.a. Lipschitz continuity with respect to θi, ∀i and functions 0 smoothing technique) to the maximization procedure as fol- gi(θi), ∀i have L -Lipschitz continuity with respect to θi, ∀i. Next, we assume the connectivity of the graph, which then we have specifies the topology of the communication channel so that 2 r0 r0 log(N) 2 the consensus step can be performed in a decentralized way. E[G ({θi ,λλi , ∀i})] ≤ O √ + O(σg (T,K)) N Assumption 2. Assume the network is well-connected (a.k.a. (16) strongly-connected), i.e., W is a double stochastic matrix. Al- 2 where constant σg (T,K) denotes the variance of PG esti- so λ (W) η < 1, where λ (W) denotes the second 0 ¯ max , ¯ max mate with respect to function g(·), and r is picked randomly largest eigenvalue of the weight matrix W. from 1,...,N. Assumption 3. We assume that the rewards in both ob- Theorem 1 says that Safe Dec-PG is able to find the solu- jective and constraints are upper bounded by G, i.e., 1/2 t t t t tion of (1) at a rate of at least O(log(N)/N ) to a neighbor- max{Ri(s ,aa ),Ci(s ,aa ), ∀i} ≤ G, and the true PG is up- hood of the -FOSP of this problem, where the radius of this 0 aτ sτ θ 0 per bounded by G , i.e., k∇ log πi(ai |s ;θi)k ≤ G , ∀i, τ. ball is determined by the number of trajectories and length The first part of Assumption 3 requires the boundedness of of the horizon approximation. The number of trajectories is the instantaneous reward, which makes sense in practice since or the longer the length is, the smaller the radius will be. the physical systems commonly output finite magnitudes of Corollary 1. Suppose Assumption 1 to Assumption 4 hold responses. The second part requires the partial derivatives θr ϑr λr τ τ and the iterates {θi ,ϑϑi ,λλi , ∀i} are generated√ by Safe Dec- of the log function of the policies, i.e., k∇ log πi(ai |s ;θi)k r r to be bounded, which can be satisfied by e.g., parametrized PG. When T, γ , β satisfy (15) and K ∼ O( N), then we have Gaussian policies. 2 r0 r0 log(N) E[G ({θi ,λλi , ∀i})] ≤ O √ (17) Assumption 4. Assume that the Slater condition is satisfied N and the size of Λ is upper bounded by σλ, i.e., Λ = {λi|λi ≥ 0, kλ k ≤ σ }, ∀i. where the total number of iterations of the algorithm is N, i λ and r0 is picked randomly from 1,...,N. Convergence Rate Note that the proposed Safe Dec-PG is not only appli- R C cable to constrained MDP problems, but also amenable to Since functions Ji and Ji , ∀i are possibly non-convex, finding the global optimal solution for this min-max problem is solve a wide class of stochastic non-convex concave min-max NP-hard in general (Nouiehed, Lee, and Razaviyayn 2018). optimization problems. It is of interest to obtain the FOSPs of problem (1). First, we Remark 2. To the best of our knowledge, our results are define the optimality gap as new in both RL and optimization communities. When T is infinitely large, i.e., f (T ) = g(T ) = 0, n I 1 X Safe Dec-PG is reduced to a decentralized stochastic non- G({θ ,λλ , ∀i}) = ∇f (θ ,λλ ) i i n i i i convex min-max optimization algorithm. In this regime, i Safe Dec-PG also provides the state-of-the-art conver- n n 1 X 1 X gence rate to a neighborhood of FOSPs. kλ − P [λ + g (θ )]k + kθ − θ¯k, (14) n i Λ i i i n i I When K and T are both infinitely large, i.e., f (T ) = i=1 i=1 2 2 g(T ) = σf (T,K) = σg (T,K) = 0, Safe Dec is re- where the first and second terms of the right hand side of duced to a deterministic decentralized non-convex min- (14) are the standard optimality gap of non-convex min/max max algorithm. The convergence rate of Safe Dec-PG problems while the third term is the consensus violation gap is still O(log(N)/N 1/2) but with guarantees to the - that characterizes the difference among the weights over the FOSPs, matching the convergence rate of HiBSA in the −1 Pn network, where θ , n i=1 θi. centralized case. ∗ ∗ Remark 3. The number of nodes, n, is not shown up in the Definition 2. If a point ({θi ,λλi , ∀i}) satisfies ∗ ∗ numerator of the convergence rate result, indicating that the kG({θi ,λλi , ∀i})k ≤ , then we call this point as an -approximate first-order stationary points of (3), abbreviat- achievable rate in (17) will not be slowed down by increasing ed as -FOSP. the number of agents and the radius of the neighborhood in ∗ ∗ (16) will not be magnified as well. Remark 1. Note that points ({θi ,λλi , ∀i}) satisfying con- ∗ ∗ dition G({θi ,λλi , ∀i}) = 0 is also known as “quasi-Nash equilibrium” points (Pang and Scutari 2011) or “first-order Numerical Results Nash equilibrium” points (Nouiehed et al. 2019). Problem setting To show the performance of safe decen- The convergence results of Safe Dec-PG are given below. tralized RL, we test our algorithm on the environment of the Cooperative Navigation task in (Lowe et al. 2017), which is Theorem 1. Suppose Assumption 1 to Assumption 4 hold built on the popular OpenAI Gym paradigm (Brockman et al. and the iterates {θr,ϑϑr,λλr, ∀i} are generated by Safe Dec- i i i 2016). The experiments were run on the NVIDIA Tesla V100 PG. If the total number of iterations of the algorithm is N GPU with 32GB memory. In the first experiment, we have and n = 5 agents aiming at finding their own landmarks, and all 1 1 T ∼ Ω(log(N)), γr ∼ O √ , βr ∼ O √ , agents are connected by a well-connected graph as shown r r in Figure 1(a), where every agent can only exchange their (15) parameters θi with its neighbors through the communication agent 5

agent 1

agent 4

agent 3 agent 2 number of iterations number of iterations (a) Network structure (b) Constrained reward (averaged) (c) Objective reward

Figure 1: (a) Diagram of a decentralized safe RL system, where the green line denotes the communication graph G, the red star represents the landmark, the blue circle stands for the agents; (b) long-term cumulative reward of the constraints v.s. the number of iterations; (c) long-term cumulative reward of the objective functions v.s. the number of iterations. The initial stepsizes of Safe Dec-PG and DSGT are both 0.1 and ci = 0.8, ∀i. channel (denoted by the green lines). Furthermore, each a- In this section, we only show the results of comparing gent has 5 action options: stay, left, right, up, and down. We Safe Dec-PG and DSGT without safety considerations in assume the states and actions of all the agents to be globally Figure 1(b) and Figure 1(c), and additional results with more observable. The goal of the teamed agents is to find an op- problem settings, e.g., larger networks, are included in the timal policy such that the long term discounted cumulative supplemental materials. From Figure 1(b), it can be observed reward averaged over the network is maximized under a min- that the averaged network constrained rewards obtained by imum number of collisions with other agents in a long term Safe Dec-PG are much higher than the ones achieved by perspective. DSGT and Safe Dec-PG converges faster than DSGT as well. From the statistic perspective, this long term cumulative Environment Different from the existing simulation en- rewards in the constraints could be interpreted as some prior vironment, we create a new one based on the cooperative knowledge accounted in MDP. From Figure 1(c), we can navigation task, where we set the agent and landmark as pairs see that the rewards in objective function achieved by both and require that each agent only targets its own corresponding Safe Dec and DSGT are similar, implying that the added landmark. The rewards considered in the objective function constraints would not affect the loss of the objective rewards. include two parts: i) the first one is based on the distance between the location of the node to its desired landmark, which Concluding Remarks is a monotonically decreasing function of the distance, (i.e., In this work, we have proposed the first algorithm of being the smaller the distance, the higher the reward will be); ii) able to solve multi-agent CMDP problems, where the cu- the second one is determined by the minimum distance be- mulative rewards in both loss function and constraints are tween two agents. If the distance between two agents is lower included. By leveraging the primal-dual optimization frame- than a threshold, then we consider that a collision happens, work, the proposed Safe Dec-PG is to maximize the averaged and both of the agents will be penalized by a large negative network long term cumulative rewards and take the safety reward value, i.e., −1. Finally, the reward at each agent is related constraints as well. Theoretically, we provide the first further scaled by different positive coefficients, representing convergence rate guarantees of the decentralized stochastic the heterogeneity, e.g., priority levels, of different agents. The gradient descent ascent method to an -FOSP of a class of rewards considered in the constraints of (3) are monotoni- non-convex min-max problems at a rate of O(1/4). Numer- cally increasing functions of the minimum distance between ical results show that the obtained constraint rewards by Safe two agents, i.e., the closer the two agents are, the lower the Dec-PG are indeed much higher than the case where the reward will be. Here, since only the minimum distance is safety consideration is not incorporated without loss of both taken into account at each node, so m = 1. convergence rate and final objective rewards. Parameters The policy at each agent is parametrized by a Acknowledgement neural network, where there are two hidden layers with 30 neurons in the first layer and 10 neurons in the second. The The research of K.Z. and T.B. was supported in part by the states of each agent include its position and velocity. Thus, US Army Research Laboratory (ARL) Cooperative Agree- the dimension of the input layer is 20, and the output layer is ment W911NF-17-2-0196, in part by the Office of Naval 5. The discounting factor γ in the cumulative loss is 0.99 in Research (ONR) MURI Grant N00014-16-1-2710, and in all the tests, and for each episode, the length of the horizon part by the Air Force Office of Scientific Research (AFOSR) approximation of PG is T = 20. Also, we run K = 10 Monte Grant FA9550-19-1-0353. The work of T. Chen was support- Carlo trials independently to compute the approximate PG at ed by the RPI-IBM Artificial Intelligence Research Collabo- each iteration. ration (AIRC). Ethical Impact Corke, P.; Peterson, R.; and Rus, D. 2005. Networked robots: Our main contributions are regarding the theoretical result- Flying robot navigation using a sensor net. Robotics Research s for solving a non-convex min-max optimization problem 234–243. over a graph/network. The algorithm design and convergence Dall’Anese, E.; Zhu, H.; and Giannakis, G. B. 2013. Dis- analysis are both new. Although Safe Dec is developed in tributed optimal power flow for smart microgrids. IEEE this paper for dealing with a constrained Markov decision Transactions on Smart Grid 4(3): 1464–1475. process related problem, it can be also applied to solve other Di Lorenzo, P.; and Scutari, G. 2016. Next: In-network min-max optimization problems. Our theoretical analysis in- nonconvex optimization. IEEE Transactions on Signal and cludes multiple new theorem proving techniques that would Information Processing over Networks 2(2): 120–136. be used for performing convergence analysis for other algorithms. This works would be beneficial for students, scientists Doan, T.; Maguluri, S.; and Romberg, J. 2019. Finite-time and professors who are conducting research in the areas of analysis of distributed TD (0) with linear function approxi- reinforcement learning, optimization, data science, finance, mation on multi-agent reinforcement learning. In Proc. of etc. We haven’t found any negative impact of this work on International Conference on Machine Learning, 1626–1635. both ethical aspects and future societal consequences. Fax, J. A.; and Murray, R. M. 2004. Information flow and cooperative control of vehicle formations. IEEE Transactions References on Automatic Control 49(9): 1465–1476. Achiam, J.; Held, D.; Tamar, A.; and Abbeel, P. 2017. Con- Gupta, J. K.; Egorov, M.; and Kochenderfer, M. 2017. Coop- strained policy optimization. In Proc. of International Con- erative multi-agent control using deep reinforcement learning. ference on Machine Learning, 22–31. In Intl. Conf. Auto. Agents and Multi-agent Systems, 66–83. Agarwal, A.; Kakade, S. M.; Lee, J. D.; and Mahajan, G. Kar, S.; Moura, J. M.; and Poor, H. V. 2013. QD-Learning: 2020. Optimality and approximation with policy gradient A collaborative distributed strategy for multi-agent reinforce- methods in Markov decision processes. In Proc. of .Confer- ment learning through consensus + innovations. IEEE Trans. ence on Learning Theory, 64–66. Sig. Proc. 61(7): 1848–1862. Lauer, M.; and Riedmiller, M. 2000. An algorithm for dis- Altman, E. 1999. Constrained Markov Decision Processes, tributed reinforcement learning in cooperative multi-agent volume 7. CRC Press. systems. In Proc. Intl. Conf. Machine Learning. Stanford, Baxter, J.; and Bartlett, P. L. 2001. Infinite-horizon policy- CA. gradient estimation. J. Artificial Intelligence Res. 15: 319– Lee, D.; He, N.; Kamalaruban, P.; and Cevher, V. 2020. Op- 350. timization for reinforcement learning: From single agent to Borkar, V. S. 2005. An actor-critic algorithm for constrained cooperative agents. IEEE Signal Processing Magazine 37(3): Markov decision processes. Systems & control letters 54(3): 123–135. 207–213. Lian, X.; Zhang, C.; Zhang, H.; Hsieh, C.-J.; Zhang, W.; Boyan, J. A.; and Littman, M. L. 1994. Packet routing in and Liu, J. 2017. Can decentralized algorithms outperform dynamically changing networks: A reinforcement learning centralized algorithms? a case study for decentralized parallel approach. In Proc. Advances in Neural Information Process- stochastic gradient descent. In Proc. of Advances in Neural ing System, 671–678. Denver, CO. Information Processing Systems, 5330–5340. Boyd, S.; Diaconis, P.; and Xiao, L. 2004. Fastest mixing Lin, T.; Jin, C.; and Jordan, M. 2020. On gradient descent Markov chain on a graph. SIAM Review 46(4): 667–689. ascent for nonconvex-concave minimax problems. In In- ternational Conference on Machine Learning, 6083–6093. Boyd, S.; and Vandenberghe, L. 2004. Convex optimization. PMLR. Cambridge University Press. Liu, B.; Cai, Q.; Yang, Z.; and Wang, Z. 2019. Neural Trust Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Region/Proximal Policy Optimization Attains Globally Op- Schulman, J.; Tang, J.; and Zaremba, W. 2016. OpenAI timal Policy. In Proc. of Advances in Neural Information gym. arXiv preprint arXiv:1606.01540 . Processing Systems, 10564–10575. Chang, T.-H.; Hong, M.; Wai, H.-T.; Zhang, X.; and Lu, S. Liu, M.; Zhang, W.; Mroueh, Y.; Cui, X.; Ross, J.; Yang, T.; 2020. Distributed learning in the non-convex world: From and Das, P. 2020a. A decentralized parallel algorithm for batch to streaming data, and beyond. IEEE Signal Processing training generative adversarial nets. In Proc. of Advances in Magazine 37(3): 26–38. Neural Information Processing Systems. Chen, T.; Zhang, K.; Giannakis, G. B.; and Ba¸sar, T. 2018. Liu, W.; Mokhtari, A.; Ozdaglar, A.; Pattathil, S.; Shen, Z.; Communication-efficient distributed reinforcement learning. and Zheng, N. 2020b. A decentralized proximal point-type arXiv preprint arXiv:1812.03239 . method for saddle point problems. arXiv:1910.14380 . Claus, C.; and Boutilier, C. 1998. The dynamics of reinforce- Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, P.; and Mor- ment learning in cooperative multiagent systems. In Proc. datch, I. 2017. Multi-agent actor-critic for mixed cooperative- of the Assoc. for the Advanc. of Artificial Intell., 746–752. competitive environments. In Proc. Advances in Neural Orlando, FL. Information Processing System. Long beach, CA. Lu, S.; Tsaknakis, I.; Hong, M.; and Chen, Y. 2020. Hybrid multi-agent reinforcement learning. In Proc. of Advances in block successive approximation for one-sided non-convex Neural Information Processing Systems, 1182–1191. min-max problems: algorithms and applications. IEEE Trans- Rafique, H.; Liu, M.; Lin, Q.; and Yang, T. 2018. Non-convex actions on Signal Processing 68: 3676–3691. min-max optimization: Provable algorithms and applications Lu, S.; and Wu, C. W. 2020. Decentralized stochastic non- in machine learning. arXiv preprint arXiv:1810.02060 . convex optimization over weakly connected time-varying digraphs. In Proc. of International Conference on Acoustics, Razaviyayn, M.; Huang, T.; Lu, S.; Nouiehed, M.; Sanjabi, Speech and Signal Processing, 5770–5774. M.; and Hong, M. 2020. Nonconvex min-max optimization: Applications, challenges, and recent theoretical advances. Lu, S.; Zhang, X.; Sun, H.; and Hong, M. 2019. GNSD: a IEEE Signal Processing Magazine 37(5): 55–66. gradient-tracking based nonconvex stochastic algorithm for decentralized optimization. In Proc. of IEEE Data Science Schneider, J.; Wong, W.-K.; Moore, A.; and Riedmiller, M. Workshop, 315–321. 1999. Distributed value functions. In Proc. Intl. Conf. Ma- chine Learning, 371–378. Bled, Slovenia. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, Sun, Y.; Daneshmand, A.; and Scutari, G. 2019. Convergence A. K.; Ostrovski, G.; et al. 2015. Human-level control through rate of distributed optimization algorithms based on gradient deep reinforcement learning. Nature 518(7540): 529. tracking. arXiv preprint arXiv:1905.02637 . Nedic, A.; Ozdaglar, A.; and Parrilo, P. A. 2010. Constrained Sutton, R. S.; and Barto, A. G. 2018. Reinforcement Learning: consensus and optimization in multi-agent networks. IEEE An Introduction. Transactions on Automatic Control 55(4): 922–938. Tang, H.; Lian, X.; Yan, M.; Zhang, C.; and Liu, J. 2018. D2: Nguyen, D. T.; Yeoh, W.; Lau, H. C.; Zilberstein, S.; and Decentralized training over decentralized data. In Proc. of Zhang, C. 2014. Decentralized multi-agent reinforcement International Conference on Machine Learning, 4848–4856. learning in average-reward dynamic DCOPs. In Proc. of Wachi, A.; and Sui, Y. 2020. Safe reinforcement learning in Proc. of the Assoc. for the Advanc. of Artificial Intell. constrained Markov decision processes. In Proc. of Interna- Nouiehed, M.; Lee, J. D.; and Razaviyayn, M. 2018. Conver- tional Conference on Machine Learning. gence to second-order stationarity for constrained non-convex optimization. arXiv preprint arXiv:1810.02024 . Wai, H.-T.; Yang, Z.; Wang, Z.; and Hong, M. 2018. Multi- agent reinforcement learning via double averaging primal- Nouiehed, M.; Sanjabi, M.; Huang, T.; Lee, J. D.; and Raza- dual optimization. In Proc. of Advances in Neural Informa- viyayn, M. 2019. Solving a class of non-convex min-max tion Processing Systems, 9649–9660. games using iterative first order methods. In Proc. of Ad- vances in Neural Information Processing Systems, 14905– Wolpert, D. H.; Wheeler, K. R.; and Tumer, K. 1999. General 14916. principles of learning-based multi-agent systems. In Proc. of the Annual Conf. on Autonomous Agents, 77–83. Seattle, Omidshafiei, S.; Pazis, J.; Amato, C.; How, J. P.; and Vian, J. WA. 2017. Deep decentralized multi-task multi-agent reinforcement learning under partial observability. In Proc. Intl. Conf. Xiao, L.; and Boyd, S. 2004. Fast linear iterations for dis- Machine Learning, 2681–2690. Sydney, Australia. tributed averaging. Systems & Control Letters 53(1): 65–78. Pang, J.-S.; and Scutari, G. 2011. Nonconvex games with Xu, P.; Gao, F.; and Gu, Q. 2020. Sample efficient policy side constraints. SIAM Journal on Optimization 21(4): 1491– gradient methods with recursive variance reduction. In Proc. 1522. of International Conference on Learning Representations. Papini, M.; Binaghi, D.; Canonaco, G.; Pirotta, M.; and Yu, M.; Yang, Z.; Kolar, M.; and Wang, Z. 2019. Convergent Restelli, M. 2018. Stochastic Variance-Reduced Policy Gra- policy optimization for safe reinforcement learning. In Proc. dient. In Proc. of International Conference on Machine of Advances in Neural Information Processing Systems, 3121– Learning, 4026–4035. 3133. Paternain, S.; Calvo-Fullana, M.; Chamon, L. F.; and Ribeiro, Zhang, K.; Koppel, A.; Zhu, H.; and Ba¸sar, T. 2020. Global A. 2019a. Safe policies for reinforcement learning via primal- convergence of policy gradient methods to (almost) locally dual methods. arXiv preprint arXiv:1911.09101 . optimal policies. SIAM Journal on Control and Optimization Paternain, S.; Chamon, L.; Calvo-Fullana, M.; and Ribeiro, A. 58(6): 3586–3612. 2019b. Constrained reinforcement learning has zero duality Zhang, K.; Yang, Z.; and Ba¸sar, T. 2019. Multi-agent re- gap. In Proc. of Advances in Neural Information Processing inforcement learning: A selective overview of theories and Systems, 7553–7563. algorithms. arXiv preprint arXiv:1911.10635 . Prashanth, L.; and Ghavamzadeh, M. 2016. Variance- Zhang, K.; Yang, Z.; Liu, H.; Zhang, T.; and Ba¸sar, T. 2018. constrained actor-critic algorithms for discounted and av- Fully decentralized multi-agent reinforcement learning with erage reward MDPs. Machine Learning 105(3): 367–417. networked agents. In Proc. of International Conference on Qu, C.; Mannor, S.; Xu, H.; Qi, Y.; Song, L.; and Xiong, J. Machine Learning, 5872–5881. 2019. Value propagation for decentralized networked deep