Sound Algorithms in Imperfect Information Games
Total Page:16
File Type:pdf, Size:1020Kb
Extended Abstract AAMAS 2021, May 3-7, 2021, Online Sound Algorithms in Imperfect Information Games Extended Abstract Michal Šustr Martin Schmid Matej Moravčík Czech Technical University DeepMind DeepMind DeepMind [email protected] [email protected] [email protected] Neil Burch Marc Lanctot Michael Bowling DeepMind DeepMind DeepMind [email protected] [email protected] [email protected] ABSTRACT We introduce a framework for evaluating the performance of an Search has played a fundamental role in computer game research online algorithm. Within this framework, we introduce the defini- since the very beginning. And while online search has been com- tion of a sound and n-sound algorithm. Like the exploitability of a monly used in perfect information games such as Chess and Go, strategy in the offline setting, the soundness of an algorithm isa online search methods for imperfect information games have only measure of if its performance against a worst case adversary. Impor- been introduced relatively recently. This paper addresses the ques- tantly, this notion collapses to the previous notion of exploitability tion of what is a sound online algorithm in an imperfect information when the algorithm follows a fixed strategy profile. setting of two-player zero-sum games? We argue that the fixed- We then introduce a consistency framework, a hierarchy that strategy definitions of exploitability and epsilon-Nash equilibria allows us to formally state in what sense an online algorithm plays are ill suited to measure the worst-case performance of an online “consistently” with an n-equilibrium. This allows to state multi- algorithm. We thus formalize epsilon-soundness, a concept that ple bounds on the soundness of the algorithm, based on the n- connects the worst-case performance of an online algorithm to equilibrium and the type of consistency. The stronger the consis- the performance of an epsilon-Nash equilibrium. Our definition tency in our hierarchy, the stronger the bounds. of soundness and the consistency hierarchy finally provide appro- A complete version of this paper can be found on arXiv [9]. priate tools to analyze online algorithms in repeated imperfect information games. We thus inspect some of the previous online 2 MOTIVATIONAL EXAMPLE algorithms in a new light, bringing new insights into their worst Consider a simple online algorithm for (r)ock (p)aper (s)cissors case performance guarantees. game that simply produces the sequence ¹A,?,B,A,?,B,...º. While this algorithm is clearly highly exploitable, the worst-case perfor- KEYWORDS mance of this algorithm can only be properly analysed through the Game Theory; Imperfect Information; Nash Equilibrium; Online repeated game settings as no offline policy captures the dynamics Algorithm; Exploitability; Soundness of this algorithm. ACM Reference Format: Michal Šustr, Martin Schmid, Matej Moravčík, Neil Burch, Marc Lanctot, 3 BACKGROUND and Michael Bowling. 2021. Sound Algorithms in Imperfect Information We present our results using the recent formalism of factored- Games: Extended Abstract. In Proc. of the 20th International Conference on observations games [6], where factored-observations game is a Autonomous Agents and Multiagent Systems (AAMAS 2021), Online, May 3–7, G hN W > A T R Oi 2021, IFAAMAS, 3 pages. tuple = , ,F , , , , 1 INTRODUCTION 3.1 Online Settings Online methods for approximating Nash equilibria in sequential The repeated game ? consists of a finite sequence of : individual imperfect information games appeared only in the last few years [2– matches < = ¹I1,I2,...,I: º, where each match I8 2 Z is a se- ¹ 0 0 1 1 ;8 −1 4, 7, 8]. We thus investigate what it takes for an online algorithm quence of world states and actions I8 = F8 , 08 ,F8 , 08 . , 08 , to be sound in imperfect information settings. While it has been ;8 º F8 . An online algorithm Ω then simply maps an information state known that search with imperfect information is more challenging observed during a match to a strategy, while possibly using its than with perfect information [5, 7], the problem is maybe more internal algorithm state (Def. 1). complex than previously thought. Online algorithms “live” in a : Given two players Ω1, Ω2, we use % to denote the distri- Ω1,Ω2 fundamentally different setting, and they need to be evaluated bution over all the possible repeated games < of length : when appropriately. these two players face each other. The average reward of < is Í: Proc. of the 20th International Conference on Autonomous Agents and Multiagent Systems R= ¹<º = 1/: 8=1 D= ¹I8 º and we denote E<∼%: »R= ¹<º¼ to be Ω1,Ω2 (AAMAS 2021), U. Endriss, A. Nowé, F. Dignum, A. Lomuscio (eds.), May 3–7, 2021, Online. the expected average reward when the players play : matches. © 2021 International Foundation for Autonomous Agents and Multiagent Systems (www.ifaamas.org). All rights reserved. 1674 Extended Abstract AAMAS 2021, May 3-7, 2021, Online Definition 1. Online algorithm Ω is a function S=×Θ 7! Δ¹A= ¹B=ºº× Definition 6. Algorithm Ω is globally consistent with n-equilibria if Θ, that maps an information state B= 2 S= to the strategy f= ¹B=º 2 n 8: 8< = ¹I1,I2,...,I: º 9f 2 NE= 8ℎ 2 H ¹I8 º Δ¹A= ¹B=ºº, while possibly making use of algorithm’s state ) 2 Θ ¹I , ..., I º and updating it. holds that Ω 1 8−1 ¹B¹ℎºº = f ¹B¹ℎºº for 88 2 f1,...,:g. However: 4 SOUNDNESS OF ONLINE ALGORITHM Theorem 7. An algorithm that is globally consistent with n-equilibria Exploitability / n-equilibrium considers the expected utility of a might not be n-sound. fixed strategy against a worst-case adversary in a single match.We thus define a similar concept for the settings of an online algorithm But what if the algorithm keeps on playing the repeated game? in a repeated game: ¹:, nº-soundness. Intuitively, an online algo- While the global consistency with equilibria does not guarantee rithm is ¹:, nº-sound if and only if it is guaranteed the same reward soundness, it guarantees that the expected average reward con- as if it followed a fixed n-equilibrium after : matches. verges to the game value in the limit. Definition 2. For an ¹:, nº-sound online algorithm Ω, the expected Theorem 8. For an algorithm Ω that is globally consistent with average reward against any opponent is at least as good as if it fol- n-equilibria, lowed an n-Nash equilibrium fixed strategy f for any number of S Δ 0 ∗ 1 matches : : 8: 8Ω2 : E<∼%: »R¹<º¼ ≥ D − n − . (2) Ω,Ω2 : 0 8 ≥ 8 0 »R¹ º¼ ≥ 0 »R¹ º¼ : : Ω2 : E<∼%: < E<∼%: < . (1) Corollary 9. An algorithm Ω that is globally consistent with n- Ω,Ω2 f,Ω2 equilibria is ¹:, nº-sound as : ! 1. If algorithm Ω is ¹:, nº-sound for 8: ≥ 1, we say the algorithm is n-sound. 5.3 Strong Global Consistency 5 CONSISTENCY HIERARCHY Strong global consistency additionally guarantees that the game- play itself is generated consistently with an equilibrium; and as As the ¹:, nº-soundness can often be infeasible to compute, we in global consistency, the partial strategies for this game-play also introduce the concept of consistency. This allows to formalize that correspond to an n-equilibrium. In other words, the online algorithm an algorithm plays “consistently” with an n-equilibrium, directly simply exactly follows a predefined equilibrium. bounding the ¹:, nº-soundness. Definition 10. Online algorithm Ω is strongly globally consistent 5.1 Local Consistency with n-equilibrium if n Local consistency simply guarantees that every time we query the 9f 2 NE= 8: 8< = ¹I1,I2,...,I: º 8ℎ 2 H ¹I: º online algorithm, there is an n-equilibrium that has the same local holds that Ω¹I1, ..., I:−1º ¹B¹ℎºº = f ¹B¹ℎºº. behavioral strategy f ¹Bº for the queried state B. Definition 3. Algorithm Ω is locally consistent with n-equilibria if Theorem 11. Online algorithm Ω that is strongly globally consis- tent with n-equilibrium is n-sound. n 8: 8< = ¹I1,I2,...,I º 8ℎ 2 H ¹I º 9f 2 NE : : = Canonical examples of strongly globally consistent online algo- ¹ º holds that Ω I1, ..., I:−1 ¹B¹ℎºº = f ¹B¹ℎºº. rithms are DeepStack/Libratus. In general, an algorithm that uses a notion of safe (continual) resolving is strongly globally consistent We then show that this link has different implication for perfect as it essentially re-solves some n-equilibrium (albeit an unknown and imperfect information games. one) that it follows. Another, more recent example is ReBeL [1], Theorem 4. An algorithm that is locally consistent with n-equilibria as it essentially imitates CFR-D iterations in conjunction with a might not be ¹:, nº-sound. neural network at a test time. 6 REGRET AND ¹:, nº-SOUNDNESS Theorem 5. In perfect information games, an algorithm that is locally consistent with a subgame perfect equilibrium is sound. Finally, the introduced notion of soundness has tight connection to the popular concept of regret when a regret minimizer is used as A particularly interesting example of an algorithm that is only the online algorithm. locally consistent is Online Outcome Sampling [7] (OOS). In the full paper version [9] we show that although this algorithm was Corollary 12. Any regret minimizer with a regret bound of ': ¹ ': º previously considered to be sound, it can produce highly exploitable is :, 2: -sound. strategies in imperfect information games. ACKNOWLEDGMENTS 5.2 Global Consistency Computational resources were supplied by the project "e-Infrastruktura Local consistency guaranteed consistency only for individual states. CZ" (e-INFRA LM2018140). This work was supported by Czech sci- Global consistency is a stronger criterion that guarantees consis- ence foundation grant no. 18-27483Y. tency with some equilibria for all the states in combination.