Actor-Critic Fictitious Play in Simultaneous Move Multistage Games Julien Pérolat, Bilal Piot, Olivier Pietquin
Total Page:16
File Type:pdf, Size:1020Kb
Actor-Critic Fictitious Play in Simultaneous Move Multistage Games Julien Pérolat, Bilal Piot, Olivier Pietquin To cite this version: Julien Pérolat, Bilal Piot, Olivier Pietquin. Actor-Critic Fictitious Play in Simultaneous Move Multi- stage Games. AISTATS 2018 - 21st International Conference on Artificial Intelligence and Statistics, Apr 2018, Playa Blanca, Lanzarote, Canary Islands, Spain. hal-01724227 HAL Id: hal-01724227 https://hal.inria.fr/hal-01724227 Submitted on 6 Mar 2018 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Actor-Critic Fictitious Play in Simultaneous Move Multistage Games Julien Perolat1 Bilal Piot1 Olivier Pietquin1 Univ. Lille Univ. Lille Univ. Lille Abstract ceive a reward informing on how good was their ac- tion when performed in the state they were in. The goal of MARL is to learn a strategy that maximally Fictitious play is a game theoretic iterative accumulates rewards over time. Whilst the problem is procedure meant to learn an equilibrium in fairly well understood when studying single agent RL, normal form games. However, this algorithm learning while independently interacting with other requires that each player has full knowledge agents remains superficially explored. The range of of other players' strategies. Using an archi- open questions is so wide in that area [31] that it tecture inspired by actor-critic algorithms, is worth giving a precise definition of our goal. In we build a stochastic approximation of the this paper, we follow a prescriptive agenda. We in- fictitious play process. This procedure is on- tend to find a learning algorithm that provably con- line, decentralized (an agent has no informa- verges to Nash equilibrium in cooperative and in non- tion of others' strategies and rewards) and cooperative games. The goal is to find a strategy that applies to multistage games (a generalization can be executed independently by each player that cor- of normal form games). In addition, we prove responds to a Nash equilibrium. Many, if not most, convergence of our method towards a Nash approaches to address this problem consider a central- equilibrium in both the cases of zero-sum ized learning procedure that produces an independent two-player multistage games and cooperative strategy for each player [21]. Centralized learning pro- multistage games. We also provide empirical cedures are quite common and often perform better evidence of the soundness of our approach on than decentralized learning procedures [13]. But these the game of Alesia with and without function centralized learning procedure require synchronization approximation. between agents during learning (which is the main lim- itation of these methods). The agenda we follow in this paper is to propose a decentralized on-line learn- 1 Introduction ing method that provably converges to a Nash equilib- rium in self-play. Decentralized algorithms, because Go, Chess, Checkers, Oshi-Zumo [10] are just a few they allow building identical independent agents that example of Multistage games [7]. In these games, the don't rely on anything but the observation of their interaction proceeds from stage to stage without loop- state and reward, no central controller being required. ing back to a previously encountered situation. This On-line algorithms, on another hand, allow learning model groups a broad class of multi-agent sequential while playing and do not require prior computation of decision processes where the interaction never goes possible strategies. back in the same state. This work focuses on Multi- This agenda is a fertile ground of interaction between Agent Reinforcement Learning [11] (MARL) in Mul- traditional RL and game theory. Indeed, RL aims at tistage games. In this multi-agent environment, play- building autonomous agents learning on-line in games ers evolve from state to state as a result of their mu- against nature (where the environment in not inter- tual actions. During this interaction, all players re- ested in wining). For that reason, a wide variety 1now with DeepMind, London (UK) of single agent RL algorithms have been adapted to multi-agent problems. But several major issues pre- Proceedings of the 21st International Conference on Ar- vent direct use of standard RL with multi-agent sys- tificial Intelligence and Statistics (AISTATS) 2018, Lan- tems. First, blindly applying single agent RL in a zarote, Spain. PMLR: Volume 84. Copyright 2018 by the decentralized fashion implies that, from each agent's author(s). point of view, the other agents are part of the envi- Actor-Critic Fictitious Play in Simultaneous Move Multistage Games ronment. Such an hypothesis breaks the crucial RL an off-policy control step whilst the second relies on a assumption that the environment is (at least almost) policy evaluation step. Although the actor-critic archi- stationary [22]. Second, it introduces partial observ- tecture is popular for its success in solving (continuous ability as each agent's knowledge is restricted to its action) RL domains, we choose this architecture for own actions and rewards while its behavior should de- a different reason. Our framework requires handling pend on others' strategies. non-stationarity (because of adaptation of the other players) which is another nice property of actor-critic Decentralized procedures (unlike counterfactual regret architectures. Our algorithms are stochastic approxi- minimization algorithms [34]) have been the topic of mations of two dynamical systems that generalize the many studies in game theory and many approaches work of [23] and [17] on the fictitious play process from were proposed from policy hill climbing methods [8, 2] normal form games to multistage games [7]. to evolutionary dynamics [33, 1] (related work will be detailed in Sec. 2). But those dynamics do not con- In the following, we first outline related work (in verge in all general-sum normal-form games, and, there Sec. 2) and then describe the necessary background exists a three-player normal form game [15] for which in both game theory and RL (Sec. 3) to introduce no first order uncoupled dynamics (i.e. most decen- our first contribution, the two-timescale algorithms tralized dynamics) can converge to a Nash equilib- (Sec. 4). These algorithms are stochastic approxi- rium. Despite this counterexample, decentralize dy- mations of two continuous-time processes defined in namics remain an important case to study because Sec. 5. Then, we study (in Sec. 5) the asymptotic be- building a central controller for a multi-agent system havior of these continuous-time processes and show, is not always possible nor is observing the actions and as a second contribution, that they converge in self- rewards of every agent. Even if decentralized learning play in cooperative games and in zero-sum two-player processes (as described in [15]) will never be guaran- games. In Sec. 6, our third contribution proves that teed to converge in general, they should be at least the algorithms are stochastic approximations of the guaranteed to converge in some interesting classes of two continuous-time processes. Finally, we perform games such as cooperative and zero-sum two-player an empirical evaluation (in Sec. 7). games. Fictitious play is a model-based process that learns 2 Related Work Nash equilibria in normal form games. It has been widely studied and required assumptions were weak- Decentralized reinforcement learning in games has ened over time [23, 17] since the original article of been studied widely in the case of normal form games Robinson [28]. It has been extended to extensive form and includes regret minimization approaches [9, 12] games (game trees) and, to a lesser extent, to func- or stochastic approximation algorithms [23]. However, tion approximation [16]. However it is neither on-line to our knowledge, none of the previous methods have nor decentralized except from the work of [23] which been extended to independent reinforcement learning focuses on normal form games and [16] that has weak in Markov Games or any intermediate models such as guarantees of convergence and focus on turn taking MSGs with guarantees of convergence both for cooper- imperfect information games. Fictitious play enjoys ative and zero-sum case. Finding a single independent several convergence guarantees [17] which makes it a RL algorithm addressing both cases is still treated as good candidate for learning in simultaneous multistage separate agendas since the seminal paper [31]. stage games. Q-Learning Like Algorithms: The adaptation of This paper contributes to fill a gap in the MARL liter- RL algorithms to the multi-agent setting was the first ature by providing two online decentralized algorithms approach to address online learning in games. On-line converging to a Nash equilibrium in multistage games algorithms like Q-learning [32] are often used in coop- both in the cooperative case and the zero-sum two- erative multi-agent learning environments but fail to player case. Those two cases used to be treated as learn a stationary strategy in simultaneous zero-sum different agendas since the seminal paper of Shoham two-player games. They fail in this setting because, & al. [31] and we expect our work to serve as a mile- in simultaneous zero-sum two-player games, it is not stone to reconcile them going further than normal form sufficient to use a greedy strategy to learn a Nash equi- games [23, 17]. Our first contribution is to propose librium. In [25], the Q-learning method is adapted to two novel on-line and decentralized algorithms inspired guarantee convergence to zero-sum two-player MGs.