An Overview of Multi-Agent Reinforcement Learning from Game Theoretical Perspective

An Overview of Multi-agent Reinforcement Learning from Game Theoretical Perspective Yaodong Yang∗1,2 and Jun Wang1,2 1University College London, 2Huawei R&D U.K. Abstract Following the remarkable success of the AlphaGO series, 2019 was a booming year that witnessed significant advances in multi-agent reinforcement learning (MARL) techniques. MARL corresponds to the learning problem in a multi-agent system in which multiple agents learn simultaneously. It is an interdisciplinary domain with a long history that includes game theory, machine learning, stochastic control, psychology, and optimisation. Although MARL has achieved considerable empirical success in solving real-world games, there is a lack of a self-contained overview in the literature that elaborates the game theoretical foundations of modern MARL methods and summarises the recent advances. In fact, the majority of existing surveys are outdated and do not fully cover the recent developments since 2010. In this work, we provide a monograph on MARL that covers both the fundamentals and the latest developments in the research frontier. Our work is separated into two parts. From 1 to 4, we present the self- x x contained fundamental knowledge of MARL, including problem formulations, basic solutions, and existing challenges. Specifically, we present the MARL formulations through two representative frameworks, namely, stochastic games and extensive- form games, along with different variations of games that can be addressed. The goal of this part is to enable the readers, even those with minimal related back- ground, to grasp the key ideas in MARL research. From 5 to 9, we present an x x overview of recent developments of MARL algorithms. Starting from new tax- onomies for MARL methods, we conduct a survey of previous survey papers. In later sections, we highlight several modern topics in MARL research, including Q-function factorisation, multi-agent soft learning, networked multi-agent MDP, stochastic potential games, zero-sum continuous games, online MDP, turn-based arXiv:2011.00583v3 [cs.MA] 18 Mar 2021 stochastic games, policy space response oracle, approximation methods in general- sum games, and mean-field type learning in games with infinite agents. Within each topic, we select both the most fundamental and cutting-edge algorithms. The goal of our monograph is to provide a self-contained assessment of the current state-of-the-art MARL techniques from a game theoretical perspective. We expect this work to serve as a stepping stone for both new researchers who are about to enter this fast-growing domain and existing domain experts who want to obtain a panoramic view and identify new directions based on recent advances. ∗This manuscript is under actively development. We appreciated any constructive comments and suggestions corresponding to: <[email protected]>. 1 Contents 1 Introduction4 1.1 A Short History of RL . .5 1.2 2019: A Booming Year for MARL . .8 2 Single-Agent RL9 2.1 Problem Formulation: Markov Decision Process . 10 2.2 Justification of Reward Maximisation . 11 2.3 Solving Markov Decision Processes . 12 2.3.1 Value-Based Methods . 13 2.3.2 Policy-Based Methods . 14 3 Multi-Agent RL 15 3.1 Problem Formulation: Stochastic Game . 16 3.2 Solving Stochastic Games . 17 3.2.1 Value-Based MARL Methods . 18 3.2.2 Policy-Based MARL Methods . 19 3.2.3 Solution Concept of the Nash Equilibrium . 19 3.2.4 Special Types of Stochastic Games . 22 3.2.5 Partially Observable Settings . 26 3.3 Problem Formulation: Extensive-Form Game . 28 3.3.1 Normal-Form Representation . 31 3.3.2 Sequence-Form Representation . 32 3.4 Solving Extensive-Form Games . 35 3.4.1 Perfect-Information Games . 35 3.4.2 Imperfect-Information Games . 36 4 Grand Challenges of MARL 38 4.1 The Combinatorial Complexity . 39 4.2 The Multi-Dimensional Learning Objectives . 40 4.3 The Non-Stationarity Issue . 41 4.4 The Scalability Issue when N 2...................... 43 5 A Survey of MARL Surveys 44 5.1 Taxonomy of MARL Algorithms . 44 5.2 A Survey of Surveys . 47 6 Learning in Identical-Interest Games 50 6.1 Stochastic Team Games . 50 2 6.1.1 Solutions via Q-function Factorisation . 51 6.1.2 Solutions via Multi-Agent Soft Learning . 53 6.2 Dec-POMDPs . 56 6.3 Networked Multi-Agent MDPs . 57 6.4 Stochastic Potential Games . 59 7 Learning in Zero-Sum Games 61 7.1 Discrete State-Action Games . 62 7.2 Continuous State-Action Games . 63 7.3 Extensive-Form Games . 66 7.3.1 Variations of Fictitious Play . 67 7.3.2 Counterfactual Regret Minimisation . 70 7.4 Online Markov Decision Processes . 74 7.5 Turn-Based Stochastic Games . 77 7.6 Open-Ended Meta-Games . 78 8 Learning in General-Sum Games 82 8.1 Solutions by Mathematical Programming . 82 8.2 Solutions by Value-Based Methods . 84 8.3 Solutions by Two-Timescale Analysis . 84 8.4 Solutions by Policy-Based Methods . 85 9 Learning in Games when N + 87 ! 1 9.1 Non-cooperative Mean-Field Game . 90 9.2 Cooperative Mean-Field Control . 93 9.3 Mean-Field MARL . 95 10 Future Directions of Interest 98 Bibliography 101 3 1 Introduction Machine learning can be considered as the process of converting data into knowledge (Shalev-Shwartz and Ben-David, 2014). The input of a learning algorithm is training data (for example, images containing cats), and the output is some knowledge (for example, rules about how to detect cats in an image). This knowledge is usually represented as a computer program that can perform certain task(s) (for example, an automatic cat detector). In the past decade, considerable progress has been made by means of a special kind of machine learning technique: deep learning (LeCun et al., 2015). One of the critical embodiments of deep learning is different kinds of deep neural networks (DNNs) (Schmidhuber, 2015) that can find disentangled representations (Bengio, 2009) in high-dimensional data, which allows the software to train itself to perform new tasks rather than merely relying on the programmer for designing hand-crafted rules. An uncountable number of breakthroughs in real-world AI applications have been achieved through the usage of DNNs, with the domains of computer vision (Krizhevsky et al., 2012) and natural language processing (Brown et al., 2020; Devlin et al., 2018) being the greatest beneficiaries. In addition to feature recognition from existing data, modern AI applications often require computer programs to make decisions based on acquired knowledge (see Figure 1). To illustrate the key components of decision making, let us consider the real-world example of controlling a car to drive safely through an intersection. At each time step, a robot car can move by steering, accelerating and braking. The goal is to safely exit the intersection and reach the destination (with possible decisions of going straight or turning left/right into another lane). Therefore, in addition to being able to detect objects, such as traffic lights, lane markings, and other cars (by converting data to knowledge), we aim to find a steering policy that can control the car to make a sequence of manoeuvres to achieve the goal (making decisions based on the knowledge gained). In a decision-making setting such as this, two additional challenges arise: 1. First, during the decision-making process, at each time step, the robot car should consider not only the immediate value of its current action but also the consequences of its current action in the future. For example, in the case of driving through an intersection, it would be detrimental to have a policy that chooses to steer in a \safe" direction at the beginning of the process if it would eventually lead to a car 4 Data Data Knowledge interactions happen! Knowledge Decisions Figure 1: Modern AI applications are being transformed from pure feature recognition (for example, detecting a cat in an image) to decision making (driving through a traffic intersection safely), where interaction among multiple agents inevitably occurs. As a result, each agent has to behave strategically. Furthermore, the problem becomes more challenging because current decisions influence future outcomes. crash later on. 2. Second, to make each decision correctly and safely, the car must also consider other cars' behaviour and act accordingly. Human drivers, for example, often predict in advance other cars' movements and then take strategic moves in response (like giving way to an oncoming car or accelerating to merge into another lane). The need for an adaptive decision-making framework, together with the complexity of addressing multiple interacting learners, has led to the development of multi-agent RL. Multi-agent RL tackles the sequential decision-making problem of having multiple intelligent agents that operate in a shared stochastic environment, each of which targets to maximise its long-term reward through interacting with the environment and other agents. Multi-agent RL is built on the knowledge of both multi-agent systems (MAS) and RL. In the next section, we provide a brief overview of (single-agent) RL and the research developments in recent decades. 1.1 A Short History of RL RL is a sub-field of machine learning, where agents learn how to behave optimally based on a trial-and-error procedure during their interaction with the environment. Unlike 5 supervised learning, which takes labelled data as the input (for example, an image labelled with cats), RL is goal-oriented: it constructs a learning model that learns to achieve the optimal long-term goal by improvement through trial and error, with the learner having no labelled data to obtain knowledge from. The word \reinforcement" refers to the learning mechanism since the actions that lead to satisfactory outcomes are reinforced in the learner's set of behaviours.

Load more