Gradient Play in Multi-Agent Markov Stochastic Games: Stationary Points and Convergence
Total Page:16
File Type:pdf, Size:1020Kb
Gradient Play in Multi-Agent Markov Stochastic Games: Stationary Points and Convergence Runyu Zhang Zhaolin Ren School of Engineering and Applied Science School of Engineering and Applied Science Harvard University Harvard university [email protected] [email protected] Na Li School of Engineering and Applied Science Harvard university [email protected] Abstract We study the performance of the gradient play algorithm for multi-agent tabular Markov decision processes (MDPs), which are also known as stochastic games (SGs), where each agent tries to maximize its own total discounted reward by making decisions independently based on current state information which is shared between agents. Policies are directly parameterized by the probability of choosing a certain action at a given state. We show that Nash equilibria (NEs) and first order stationary policies are equivalent in this setting, and give a non-asymptotic global convergence rate analysis to an -NE for a subclass of multi-agent MDPs called Markov potential games, which includes the cooperative setting with identical rewards among agents as an important special case. Our result shows that the number of iterations to reach an -NE scales linearly, instead of exponentially, with the number of agents. Local geometry and local stability are also considered. For Markov potential games, we prove that strict NEs are local maxima of the total potential function and fully-mixed NEs are saddle points. We also give a local convergence rate around strict NEs for more general settings. 1 Introduction The past decade has witnessed significant development in reinforcement learning (RL), which achieves successes in various tasks such as playing Go and video games. It is natural to extend RL arXiv:2106.00198v2 [cs.LG] 17 Jun 2021 techniques to real-life societal systems such as traffic control, autonomous driving, buildings, and energy systems. Since most such large scale infrastructures are multi-agent in nature, multi-agent reinforcement learning (MARL) has gained increasing attention in recent years [Daneshfar and Bevrani, 2010, Shalev-Shwartz et al., 2016, Vidhate and Kulkarni, 2017, Xu et al., 2020]. Among RL algorithms, policy gradient methods are particularly attractive due to their flexibility and capability to incorporate structured state and action spaces. This property makes them appealing for multi-agent learning, where agents usually need to update their policies through interactions with other agents either collaboratively or competitively. For instance, many recent works [Zhang et al., 2018, Chen et al., 2018, Wai et al., 2018, Li et al., 2019, Qu et al., 2020] have studied the convergence rate and sample complexity of gradient-based methods for collaborative multi-agent RL problems. In these problems, agents seek to maximize a global reward function collaboratively while the agents’ policies and the learning procedure suffer from information constraints, i.e., each agent can only choose its own local actions based on the information it observes. However, due to a lack of understanding of the optimization landscape in these multi-agent learning problems, most such works Preprint. Under review. can only show convergence to a first-order stationary point. Deeper understanding of the quality of these stationary points is missing even in the simple identical-reward multi-agent RL setting. In contrast, there has been some exciting recent theoretical progress on the analysis of the optimization landscape in centralized single-agent RL settings. Recent works have shown that the landscape for single-agent policy optimization enjoys the gradient domination property in both linear control [Fazel et al., 2018] and Markov decision processes (MDPs) [Agarwal et al., 2020], which guarantees gradient descent/ascent to find the global optimum despite the nonconvex landscape. Motivated by the theoretical progress in single-agent RL, we seek to study the landscape of multi-agent RL problems to see if similar results exist. In this paper, we center our study on the multi-agent tabular MDP problem. Apart from the identical reward case mentioned earlier, we also make an attempt to generalize our analysis to game settings where rewards may vary among agents. The multi-agent tabular MDP problem is also known as the stochastic game (SG) in the field of game theory. The study of SGs dates back to as early as the 1950s by Shapley [1953], where the notion of SGs as well as the existence of Nash equilibria (NEs) were first established. A series of works has since been developed on designing NE-finding algorithms, especially in the RL setting (e.g. [Littman, 1994, Bowling and Veloso, 2000, Shoham et al., 2003, Bu¸soniuet al., 2010, Lanctot et al., 2017, Zhang et al., 2019a] and citations within). While well-known classical algorithms for solving SGs are mostly value-based, such as Nash-Q learning [Hu and Wellman, 2003], Hyper-Q learning [Tesauro, 2003], and WoLF-PHC [Bowling and Veloso, 2001], gradient-based algorithms have also started to gain popularity in recent years due to their advantages as mentioned earlier (e.g. [Abdallah and Lesser, 2008, Foerster et al., 2017, Zhang and Lesser, 2010]). In this work, our aim is to gain a deeper understanding of the structure and quality of first-order stationary points for these gradient-based methods. Specifically, taking a game-theoretic perspective, we strive to shed light on the following questions: 1) How do the first- order stationary points relate to the NEs of the underlying game?, 2) Do gradient-based algorithms guarantee convergence to a NE?, 3) What is the stability of individual NEs?. For simpler finite action static (stateless) game settings, these questions have already been widely discussed [Shapley, 1964, Crawford, 1985, Jordan, 1993, Krishna and Sjöström, 1998, Shamma and Arslan, 2005, Kohlberg and Mertens, 1986, Van Damme, 1991]. For static continuous games, a recent paper [Mazumdar et al., 2020] in fact proved a negative result which suggests that gradient flow has stationary points (even local maxima) that are not necessarily NEs. Conversely, Zhang et al. [2019b] designed projected nested-gradient methods that provably converge to NEs in zero-sum linear quadratic games with continuous state-action spaces, linear dynamics, and quadratic rewards. However, much less is known in the setting of SGs with finite state-action spaces and general Markov transition probability. Contributions. In our paper, we consider the gradient play algorithm for the infinite time-discounted reward SG where an agent’s local policy is directly parameterized by the probability of choosing an action from the agent’s own action space at a given state. We first show that first order stationary policies and Nash equilibria are equivalent for these directly parameterized local policies. We derive this by generalizing the gradient domination property in [Agarwal et al., 2020] to the multi-agent setting. Our result does not contradict Mazumdar et al. [2020]’s work which constructs examples of stable first order stationary points that are non-NEs, because their counterexamples consider general continuous games where the utility functions may not satisfy gradient domination. Additionally, we provide a global convergence rate analysis for a special class of SG called Markov potential games [González-Sánchez and Hernández-Lerma, 2013, Macua et al., 2018], which includes identical reward multi-agent RL [Tan, 1993, Claus and Boutilier, 1998, Panait and Luke, 2005] as an important special case. We show that gradient play (equivalent to projected gradient ascent for P jSj i jAij Markov potential games) reaches an -Nash equilibrium within O 2 steps, where jSj; jAij denote the size of the state space and action space of agent i respectively. The convergence rate shows that the number of iterations to reach an -NE scales linearly with the number of agents, instead of Q jSj i jAij exponentially with rate O 2 as shown in the work of Agarwal et al. [2020]. Although the convergence to NEs of different learning algorithms in static potential games has been very well-studied in the literature [Monderer and Shapley, 1996a,b, Monderer and Sela, 1997, Marden et al., 2009], to the best of our knowledge, our result provides the first non-asymptotic convergence rate analysis of convergence to a NE in stochastic games for gradient play. We also study the local geometry around some specific types of equilibrium points. For Markov potential games, we show that strict NEs are local maxima of the total potential function and that fully mixed NEs are saddle 2 points. For general multi-agent MDPs, we show that strict NEs are locally stable under gradient play and provide a local convergence rate analysis. 2 Problem setting and preliminaries An n-agent (tabular) Markov decision process (MDP) (or a stochastic game (SG) [Shapley, 1953]) M = (S; A = A1 × A2 × · · · × An; P; r = (r1; r2; : : : ; rn); γ; ρ) (1) is specified by: a finite state space S; a finite action space A = A1 × A2 × · · · × An, where Ai is 0 0 the action space of agent i; a transition model P where P (s js; a) = P (s js; a1; a2; : : : ; an) is the 0 probability of transitioning into state s upon taking action a := (a1; : : : ; an) (each agent i taking action ai respectively) in state s; i-th agent’s reward function ri : S × A ! [0; 1]; i = 1; 2; : : : ; n where ri(s; a) is the immediate reward of agent i associated with taking action a in state s; a discount factor γ 2 [0; 1); an initial state distribution ρ over S. A stochastic policy π : S! ∆(A) (where ∆(A) is the probability simplex over A) specifies a decision-making strategy in which agents choose their actions jointly based on the current state in a stochastic fashion, i.e. Pr(atjst) = π(atjst). A distributed stochastic policy is a special subclass of stochastic policies, with π = π1 × π2 × · · · × πn, where πi : S! ∆(Ai).