<<

Extended Abstract AAMAS 2021, May 3-7, 2021, Online

Sound Algorithms in Imperfect Information Extended Abstract Michal Šustr Martin Schmid Matej Moravčík Czech Technical University DeepMind DeepMind DeepMind [email protected] [email protected] [email protected] Neil Burch Marc Lanctot Michael Bowling DeepMind DeepMind DeepMind [email protected] [email protected] [email protected]

ABSTRACT We introduce a framework for evaluating the performance of an Search has played a fundamental role in computer research online algorithm. Within this framework, we introduce the defini- since the very beginning. And while online search has been com- tion of a sound and 휖-sound algorithm. Like the exploitability of a monly used in perfect information games such as and Go, in the offline setting, the soundness of an algorithm isa online search methods for imperfect information games have only measure of if its performance against a worst case adversary. Impor- been introduced relatively recently. This paper addresses the ques- tantly, this notion collapses to the previous notion of exploitability tion of what is a sound online algorithm in an imperfect information when the algorithm follows a fixed strategy profile. setting of two-player zero-sum games? We argue that the fixed- We then introduce a consistency framework, a hierarchy that strategy definitions of exploitability and epsilon-Nash equilibria allows us to formally state in what sense an online algorithm plays are ill suited to measure the worst-case performance of an online “consistently” with an 휖-equilibrium. This allows to state multi- algorithm. We thus formalize epsilon-soundness, a concept that ple bounds on the soundness of the algorithm, based on the 휖- connects the worst-case performance of an online algorithm to equilibrium and the type of consistency. The stronger the consis- the performance of an epsilon-. Our definition tency in our hierarchy, the stronger the bounds. of soundness and the consistency hierarchy finally provide appro- A complete version of this paper can be found on arXiv [9]. priate tools to analyze online algorithms in repeated imperfect information games. We thus inspect some of the previous online 2 MOTIVATIONAL EXAMPLE algorithms in a new light, bringing new insights into their worst Consider a simple online algorithm for (r)ock (p)aper (s)cissors case performance guarantees. game that simply produces the sequence (푟,푝,푠,푟,푝,푠,...). While this algorithm is clearly highly exploitable, the worst-case perfor- KEYWORDS mance of this algorithm can only be properly analysed through the ; Imperfect Information; Nash Equilibrium; Online settings as no offline policy captures the dynamics Algorithm; Exploitability; Soundness of this algorithm. ACM Reference Format: Michal Šustr, Martin Schmid, Matej Moravčík, Neil Burch, Marc Lanctot, 3 BACKGROUND and Michael Bowling. 2021. Sound Algorithms in Imperfect Information We present our results using the recent formalism of factored- Games: Extended Abstract. In Proc. of the 20th International Conference on observations games [6], where factored-observations game is a Autonomous Agents and Multiagent Systems (AAMAS 2021), Online, May 3–7, G ⟨N W 표 A T R O⟩ 2021, IFAAMAS, 3 pages. tuple = , ,푤 , , , ,

1 INTRODUCTION 3.1 Online Settings Online methods for approximating Nash equilibria in sequential The repeated game 푝 consists of a finite sequence of 푘 individual imperfect information games appeared only in the last few years [2– matches 푚 = (푧1, 푧2, . . . , 푧푘 ), where each match 푧푖 ∈ Z is a se- ( 0 0 1 1 푙푖 −1 4, 7, 8]. We thus investigate what it takes for an online algorithm quence of world states and actions 푧푖 = 푤푖 , 푎푖 , 푤푖 , 푎푖 . . . , 푎푖 , to be sound in imperfect information settings. While it has been 푙푖 ) 푤푖 . An online algorithm Ω then simply maps an information state known that search with imperfect information is more challenging observed during a match to a strategy, while possibly using its than with perfect information [5, 7], the problem is maybe more internal algorithm state (Def. 1). complex than previously thought. Online algorithms “live” in a 푘 Given two players Ω1, Ω2, we use 푃 to denote the distri- Ω1,Ω2 fundamentally different setting, and they need to be evaluated bution over all the possible repeated games 푚 of length 푘 when appropriately. these two players face each other. The average reward of 푚 is Í푘 Proc. of the 20th International Conference on Autonomous Agents and Multiagent Systems R푛 (푚) = 1/푘 푖=1 푢푛 (푧푖 ) and we denote E푚∼푃푘 [R푛 (푚)] to be Ω1,Ω2 (AAMAS 2021), U. Endriss, A. Nowé, F. Dignum, A. Lomuscio (eds.), May 3–7, 2021, Online. the expected average reward when the players play 푘 matches. © 2021 International Foundation for Autonomous Agents and Multiagent Systems (www.ifaamas.org). All rights reserved.

1674 Extended Abstract AAMAS 2021, May 3-7, 2021, Online

Definition 1. Online algorithm Ω is a function S푛×Θ ↦→ Δ(A푛 (푠푛))× Definition 6. Algorithm Ω is globally consistent with 휖-equilibria if Θ, that maps an information state 푠푛 ∈ S푛 to the strategy 휎푛 (푠푛) ∈ 휖 ∀푘 ∀푚 = (푧1, 푧2, . . . , 푧푘 ) ∃휎 ∈ NE푛 ∀ℎ ∈ H (푧푖 ) Δ(A푛 (푠푛)), while possibly making use of algorithm’s state 휽 ∈ Θ (푧 , ..., 푧 ) and updating it. holds that Ω 1 푖−1 (푠(ℎ)) = 휎 (푠(ℎ)) for ∀푖 ∈ {1, . . . , 푘}. However: 4 SOUNDNESS OF ONLINE ALGORITHM Theorem 7. An algorithm that is globally consistent with 휖-equilibria Exploitability / 휖-equilibrium considers the expected utility of a might not be 휖-sound. fixed strategy against a worst-case adversary in a single match.We thus define a similar concept for the settings of an online algorithm But what if the algorithm keeps on playing the repeated game? in a repeated game: (푘, 휖)-soundness. Intuitively, an online algo- While the global consistency with equilibria does not guarantee rithm is (푘, 휖)-sound if and only if it is guaranteed the same reward soundness, it guarantees that the expected average reward con- as if it followed a fixed 휖-equilibrium after 푘 matches. verges to the game value in the limit. Definition 2. For an (푘, 휖)-sound online algorithm Ω, the expected Theorem 8. For an algorithm Ω that is globally consistent with average reward against any opponent is at least as good as if it fol- 휖-equilibria, lowed an 휖-Nash equilibrium fixed strategy 휎 for any number of S Δ ′ ∗ 1 matches 푘 : ∀푘 ∀Ω2 : E푚∼푃푘 [R(푚)] ≥ 푢 − 휖 − . (2) Ω,Ω2 푘 ′ ∀ ≥ ∀ ′ [R( )] ≥ ′ [R( )] 푘 푘 Ω2 : E푚∼푃푘 푚 E푚∼푃푘 푚 . (1) Corollary 9. An algorithm Ω that is globally consistent with 휖- Ω,Ω2 휎,Ω2 equilibria is (푘, 휖)-sound as 푘 → ∞. If algorithm Ω is (푘, 휖)-sound for ∀푘 ≥ 1, we say the algorithm is 휖-sound. 5.3 Strong Global Consistency 5 CONSISTENCY HIERARCHY Strong global consistency additionally guarantees that the game- play itself is generated consistently with an equilibrium; and as As the (푘, 휖)-soundness can often be infeasible to compute, we in global consistency, the partial strategies for this game-play also introduce the concept of consistency. This allows to formalize that correspond to an 휖-equilibrium. In other words, the online algorithm an algorithm plays “consistently” with an 휖-equilibrium, directly simply exactly follows a predefined equilibrium. bounding the (푘, 휖)-soundness. Definition 10. Online algorithm Ω is strongly globally consistent 5.1 Local Consistency with 휖-equilibrium if 휖 Local consistency simply guarantees that every time we query the ∃휎 ∈ NE푛 ∀푘 ∀푚 = (푧1, 푧2, . . . , 푧푘 ) ∀ℎ ∈ H (푧푘 ) online algorithm, there is an 휖-equilibrium that has the same local holds that Ω(푧1, ..., 푧푘−1) (푠(ℎ)) = 휎 (푠(ℎ)). behavioral strategy 휎 (푠) for the queried state 푠. Definition 3. Algorithm Ω is locally consistent with 휖-equilibria if Theorem 11. Online algorithm Ω that is strongly globally consis- tent with 휖-equilibrium is 휖-sound. 휖 ∀푘 ∀푚 = (푧1, 푧2, . . . , 푧 ) ∀ℎ ∈ H (푧 ) ∃휎 ∈ NE 푘 푘 푛 Canonical examples of strongly globally consistent online algo- ( ) holds that Ω 푧1, ..., 푧푘−1 (푠(ℎ)) = 휎 (푠(ℎ)). rithms are DeepStack/Libratus. In general, an algorithm that uses a notion of safe (continual) resolving is strongly globally consistent We then show that this link has different implication for perfect as it essentially re-solves some 휖-equilibrium (albeit an unknown and imperfect information games. one) that it follows. Another, more recent example is ReBeL [1], Theorem 4. An algorithm that is locally consistent with 휖-equilibria as it essentially imitates CFR-D iterations in conjunction with a might not be (푘, 휖)-sound. neural network at a test time. 6 REGRET AND (푘, 휖)-SOUNDNESS Theorem 5. In perfect information games, an algorithm that is locally consistent with a perfect equilibrium is sound. Finally, the introduced notion of soundness has tight connection to the popular concept of regret when a regret minimizer is used as A particularly interesting example of an algorithm that is only the online algorithm. locally consistent is Online Sampling [7] (OOS). In the full paper version [9] we show that although this algorithm was Corollary 12. Any regret minimizer with a regret bound of 푅푘 ( 푅푘 ) previously considered to be sound, it can produce highly exploitable is 푘, 2푘 -sound. strategies in imperfect information games. ACKNOWLEDGMENTS 5.2 Global Consistency Computational resources were supplied by the project "e-Infrastruktura Local consistency guaranteed consistency only for individual states. CZ" (e-INFRA LM2018140). This work was supported by Czech sci- Global consistency is a stronger criterion that guarantees consis- ence foundation grant no. 18-27483Y. tency with some equilibria for all the states in combination.

1675 Extended Abstract AAMAS 2021, May 3-7, 2021, Online

REFERENCES [1] Noam Brown, Anton Bakhtin, Adam Lerer, and Qucheng Gong. 2020. Combining Deep Reinforcement Learning and Search for Imperfect-Information Games. arXiv preprint arXiv:2007.13544 (2020). [2] Noam Brown and Tuomas Sandholm. 2017. Safe and nested subgame solving for imperfect-information games. In Advances in Neural Information Processing Systems. 689–699. [3] Noam Brown and Tuomas Sandholm. 2018. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science 359, 6374 (2018), 418–424. [4] Noam Brown and Tuomas Sandholm. 2019. Superhuman AI for multiplayer poker. Science 365, 6456 (2019), 885–890. [5] Ian Frank and David Basin. 1998. Search in games with incomplete information: A case study using bridge card play. 100, 1-2 (1998), 87–123. [6] Vojtěch Kovařík, Martin Schmid, Neil Burch, Michael Bowling, and Viliam Lisý. 2019. Rethinking formal models of partially observable multiagent decision making. arXiv preprint arXiv:1906.11110 (2019). [7] Viliam Lisý, Marc Lanctot, and Michael Bowling. 2015. Online Monte Carlo counterfactual regret minimization for search in imperfect information games. In Proceedings of the 2015 International Conference on Autonomous Agents and Multia- gent Systems. International Foundation for Autonomous Agents and Multiagent Systems, 27–36. [8] Matej Moravcik, Martin Schmid, Neil Burch, Viliam Lisý, Dustin Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson, and Michael Bowling. 2017. Deepstack: Expert-level artificial intelligence in heads-up no-limit poker. Science 356, 6337 (2017), 508–513. [9] Michal Šustr, Martin Schmid, Matej Moravčík, Neil Burch, Marc Lanctot, and Michael Bowling. 2020. Sound search in imperfect information games. arXiv preprint arXiv:2006.08740 (2020).

1676