Survey of Self-Play in Reinforcement Learning
Total Page:16
File Type:pdf, Size:1020Kb
SURVEY OF SELF-PLAY IN REINFORCEMENT LEARNING APREPRINT Anthony DiGiovanni1 and Ethan Zell2 Departments of Statistics1 & Mathematics2 University of Michigan {adigi,ezell}@umich.edu July 8, 2021 ABSTRACT In reinforcement learning (RL), the term self-play describes a kind of multi-agent learning (MAL) that deploys an algorithm against copies of itself to test compatibility in various stochastic environments. As is typical in MAL, the literature draws heavily from well-established concepts in classical game theory and so this survey quickly reviews some fundamental concepts. In what follows, we present a brief survey of self-play literature, its major themes, criteria, and techniques, and then conclude with an assessment of current shortfalls of the literature as well as suggestions for future directions. 1 Introduction and Background Multi-agent learning (MAL) research aims to study interacting systems of agents in a shared environment, the incentives that govern the agents’ behavior, and the optimal way to behave in such environments. As we will see, the definitions of “optimality,” “environment,” and even “incentive” are crucial to developing useful problem statements and powerful solutions. Several agendas with distinct objectives and evaluation criteria have been developed for MAL [19]. From the perspective of online and reinforcement learning (RL), it is natural to formulate MAL as design of algorithms that can achieve a user’s goals optimally (or minimize regret), conditional on an environment induced by the actions taken by other algorithms. Model classes are constructed for this induced environment, on a spectrum from fully stochastic and equivalent to a Markov decision process (MDP) [4], to fully adversarial and unpredictable [3]. Alternatively, game theoretic approaches typically ask how to converge to an equilibrium, assuming play by other agents that is “rational” in some sense. For the design of an algorithm that multiple users will want to adopt, a guiding principle is self-play compatibility (or simply “compatibility”): achieving some high performance benchmark when copies of the algorithm play each other [16]. Maintaining such performance while also adapting to non-self agents, in line with the RL agenda, arXiv:2107.02850v1 [cs.GT] 6 Jul 2021 presents a challenging research program. To appraise the state of compatibility research, particularly theoretical results, we present a survey of literature on this problem and conjecture some implications of these works. Our survey has revealed a number of limitations, open problems, and themes. First, even within the narrow MAL subproblem of compatibility, different subsets of the literature pursue disparate definitions of “good” performance (detailed in Section 3.2). These range from the relatively simple property of convergence to any Nash equilibrium of the stage game [4, 5], to the more demanding criteria of Pareto efficiency [16] in an equilibrium of the stage or repeated game, and fairness in the senses studied in cooperative bargaining theory [20]. We argue that research targeted at these latter criteria is practically essential, given that multiple users of a MAL algorithm will want it to achieve the highest rewards possible within the constraints of cooperation. Second, finite-time guarantees (particularly regret bounds) of self-play performance are almost nonexistent, and the exceptions [13, 20] are not designed or proven to maintain high rewards against simple classes of non-self players. While several works prove polynomial-time convergence, losses incurred during the time it takes for self-copies to coordinate on an equilibrium may be significant. Survey of Self-Play in Reinforcement Learning APREPRINT Third, most existing algorithms with both provable self-play compatibility, and success against other classes of players, use an approach that can be summarized as: “propose, then adapt.” The proposal is the solution that all self copies eventually coordinate on, such as a Nash equilibrium. At least when the preferences of the other agents are known, satisfying any single property desired for MAL algorithms is not too demanding. The challenge lies in simultaneously maintaining multiple guarantees. Thus, many compatible algorithms identify a target class of non-self players to adapt to, in the event that an initial testing period gives evidence that the other player is not a copy. 2 Research Aims This survey explores and conjectures answers for the following questions: • Which formal models of self-play are most theoretically tractable, and what proof techniques and tools are most commonly used? (For example, how do proofs of convergence in self-play [4, 5] overcome the difficulty of adaptive nonstationarity, which depends on the actions of each copy of the algorithm?) • Because “optimal” self-play does not appear well-defined, what are the standard benchmarks for evaluating an RL algorithm’s performance in self-play, especially in regret definitions? Which computational tradeoffs might exist for higher standards, e.g. achieving Pareto efficient solutions [7, 20] versus any Nash equilibrium [5]? • How do certain algorithms [4, 17, 20] maintain self-play compatibility while also guaranteeing safety against adversarial players, and optimality against stochastic ones? Our purpose is a focused treatment of self-play theory, rather than a broad list of summaries of the many papers in this field. Omission of a given work is therefore not an implicit judgment of the lack of importance of that work. 3 Key Results 3.1 Background: Markov Games Markov games appear as an extension of usual game theory since they incorporate some element of randomness or hidden values to a game theoretic scenario. Since Markov games involve two or more deciding agents, they also extend the notion of an MDP. For our purposes, a Markov game will adhere to several simplifying assumptions. As usual in reinforcement learning, the game’s players interact with the environment by choosing actions from a (finite) set Ai and receiving a (possibly random) reward. Games are also repeated. Moreover, Markov games are often assumed to have infinite or unknown horizon with “enough” time for players to learn the game. For simplicity, we discuss games that progress in discrete steps t 2 N, though it is possible to adapt many of these i techniques for a continuous time analogue. For player i at time t, we will denote Rt to be the reward. Some papers view this reward as deterministic [7], while others view it as the distribution of a random variable. In the case the reward is random, players only receive a sample from the distribution and therefore must make estimates as to what the i th distribution actually is. Often, Rt will depend not only on the i player’s action, but also on the other players’ actions i Q as well. To emphasize this, we may write Rt(a) where a 2 A := i Ai. We will also further restrict ourselves to the case where there are finitely many players, f1; : : : ; ng, and finitely many states, f1; : : : ; dg. For brevity, we will denote these sets by [n] and [d], respectively. Definition 1 A strategy or policy πi for player i is a map that takes as inputs the game stage and player’s possible positions to the set of probability distributions over actions: πi : N × [d] !P(Ai): If the strategy πi is independent of time, we refer to it as a stationary strategy. We will denote the set of all strategies for player i by Si. One could also define a strategy that depends on the positions of all players. th In words, πi tells the i player how to behave given its own state and the stage of the game. Player i will then choose action a 2 Ai with probability (πi)a. Note that this definition includes deterministic strategies since one could specify the probability of a particular action to be 1. Definition 2 We define a strategy profile π as the map which contains the information on the strategies of all players: [n] Y π := π1 × · · · × πn : N × [d] ! P(Ai): i2[n] 2 Survey of Self-Play in Reinforcement Learning APREPRINT In several of the papers we discuss, it is allowed that players have access to different actions [13]. Still, an important special case is called a repeated symmetric game (RSG). Definition 3 A repeated Markov game is called symmetric if, for any strategy profile π and any permutation of the players σ 2 Sn, σ(i) i R (π−i; πi) = R (πσ(−i); πσ(i)) th for all i 2 [n]. By π−i we mean the profile π, but with the i entry omitted. Similarly, by σ(−i) we mean the permutation σ applied to all players except the ith. In both symmetric and general games, the concept of Nash equilibrium (NE) arises naturally. More generally, ∗ Definition 4 Let " ≥ 0. A strategy profile π is called a "-NE if for all strategies πi 2 Si: i ∗ i ∗ E[Rt(πi; π−i)] ≤ E[Rt(π )] + "; for all players i 2 [n]. When " = 0, this is the usual notion of NE. In classical game theory, NE do an excellent job of assessing when players have no incentive to deviate from established behavior. Unfortunately, NE may be sub-optimal. For instance, consider the Stag Hunt game (Table 1). When both players cooperate and hunt the stag, both receive higher reward than all other actions in the game. However, if one player defects and hunts the rabbit instead, then neither succeeds at hunting the stag alone. Here, we can check there are two NE: (Stag; Stag) and (Rabbit; Rabbit). Yet, (Rabbit; Rabbit) results in only a fraction of the maximum reward. Hence, we introduce alternative notions to test for optimality. 3.2 Selected Evaluation Criteria in Multi-Agent Learning The literature on self-play features various solution concepts. We present both the self-play criteria and complementary criteria for other-play.