Stable-Predictive Optimistic Counterfactual Regret Minimization

Stable-Predictive Optimistic Counterfactual Regret Minimization Gabriele Farina 1 Christian Kroer 2 Noam Brown 1 Tuomas Sandholm 1 3 4 5 Abstract were used as an essential ingredient for all recent milestones in the benchmark domain of poker (Bowling et al., 2015; The CFR framework has been a powerful tool Moravcˇ´ık et al., 2017; Brown & Sandholm, 2017b). Despite for solving large-scale extensive-form games in this practical success all known CFR variants have a signif- practice. However, the theoretical rate at which icant theoretical drawback: their worst-case convergence 1=2 past CFR-based algorithms converge to the Nash rate is on the order of O(T − ), where T is the number of 1=2 equilibrium is on the order of O(T − ), where T iterations. In contrast to this, there exist first-order methods 1 is the number of iterations. In contrast, first-order that converge at a rate of O(T − ) (Hoda et al., 2010; Kroer 1 methods can be used to achieve a O(T − ) depen- et al., 2015; 2018b). However, these methods have been dence on iterations, yet these methods have been found to perform worse than newer CFR algorithms such as less successful in practice. In this work we present CFR+, in spite of their theoretical advantage (Kroer et al., the first CFR variant that breaks the square-root 2018b;a). dependence on iterations. By combining and ex- tending recent advances on predictive and stable In this paper we present the first CFR variant which breaks regret minimizers for the matrix-game setting we the square-root dependence on the number of iterations. By show that it is possible to leverage “optimistic” leveraging recent theoretical breakthroughs on “optimistic” 3=4 regret minimizers to achieve a O(T − ) conver- regret minimizers for the matrix-game setting, we show how gence rate within CFR. This is achieved by intro- to set up optimistic counterfactual regret minimizers at each ducing a new notion of stable-predictivity, and by information set such that the overall algorithm retains the setting the stability of each counterfactual regret properties needed in order to accelerate convergence. In minimizer relative to its location in the decision particular, this leads to a predictive and stable variant of 3=4 tree. Experiments show that this method is faster CFR that converges at a rate of O(T − ). than the original CFR algorithm, although not as Typical analysis of regret-minimization leads to a conver- 1=2 fast as newer variants, in spite of their worst-case gence rate of O(T − ) for solving zero-sum matrix games. 1=2 O(T − ) dependence on iterations. However, by leveraging the idea of optimistic learning (Chi- ang et al., 2012; Rakhlin & Sridharan, 2013a;b; Syrgka- nis et al., 2015; Wang & Abernethy, 2018), Rakhlin and 1. Introduction Sridharan show in a series of papers that it is possible to 1 converge at a rate of O(T − ) when leveraging cancellations Counterfactual regret minimization (CFR) (Zinkevich et al., that occur due to the optimistic mirror descent (OMD) al- 2007) and later variants such as Monte-Carlo CFR (Lanctot gorithm (Rakhlin & Sridharan, 2013a;b). Syrgkanis et al. + et al., 2009), CFR (Tammelin et al., 2015), and Discounted (2015) build on this idea, and introduce the optimistic follow- CFR (Brown & Sandholm, 2019), have been the practical the-regularized-leader (OFTRL) algorithm; they show that state-of-the-art in solving large-scale zero-sum extensive- even when the players do not employ the same algorithm, a form games (EFGs) for the last decade. These algorithms 3=4 rate of O(T − ) can be achieved as long as each algorithm 1Computer Science Department, Carnegie Mellon Uni- belongs to a class of algorithms that satisfy a stability crite- versity, Pittsburgh PA 15213 2IEOR Department, Columbia rion and leverage predictability of loss inputs. We build on University, New York NY 10027 3Strategic Machine, Inc. this latter generalization. Because we can only perform the 4 5 Strategy Robot, Inc. Optimized Markets, Inc.. Corre- optimistic updates locally with respect to counterfactual re- < > spondence to: Gabriele Farina [email protected] , grets we cannot achieve the cancellations that leads to a rate Christian Kroer <[email protected]>, Noam 1 Brown <[email protected]>, Tuomas Sandholm <sand- of O(T − ); instead we show that by carefully instantiating [email protected]>. each counterfactual regret minimizer it is possible to main- tain predictability and stability with respect to the overall th Proceedings of the 36 International Conference on Machine decision-tree structure, thus leading to a convergence rate of Learning , Long Beach, California, PMLR 97, 2019. Copyright O(T 3=4). In order to achieve these results we introduce a 2019 by the author(s). − Stable-Predictive Optimistic Counterfactual Regret Minimization new variant of stable-predictivity, and show that each local decision node j 2 J , the agent chooses a strategy from the nj counterfactual regret minimizer must have its stability set simplex ∆ of all probability distributions over the set Aj relative to its location in the overall strategy space, with of nj = jAjj actions available at that decision node. An regret minimizers deeper in the decision tree requiring more action is sampled according to the chosen distribution, and stability. the agent then waits to play again. While waiting, the agent might receive a signal (observation) from the process; this In addition to our theoretical results we investigate the prac- possibility is represented with an observation node. At a tical performance of our algorithm on several poker sub- generic observation point k 2 K, the agent might receive games from the Libratus AI which beat top poker profes- n signals; the set of signals that the agent can observe is sionals (Brown & Sandholm, 2017b). We find that our CFR k denoted as S . The observation node that is reached by the variant coupled with the OFTRL algorithm and the entropy k agent after picking action a 2 A at decision point j 2 J is regularizer leads to better convergence rate than the vanilla j denoted by ρ(j; a). Likewise, the decision node reached by CFR algorithm with regret matching, while it does not out- the agent after observing signal s 2 S at observation point perform the newer state-of-the-art algorithm Discounted k k 2 K is denoted by ρ(k; s). The set of all observation points CFR (DCFR) (Brown & Sandholm, 2019). This latter fact reachable from j 2 J is denoted as C := fρ(j; a) : a 2 A g. is not too surprising, as it has repeatedly been observed that j j Similarly, the set of all decision points reachable from k 2 K CFR+, and the newer and faster DCFR, converges at a rate is denoted as C := fρ(k; s) : s 2 S g. To ease the notation, better than O(T 1) for many practical games of interest, in k k − sometimes we will use the notation C to mean C .A spite of the worst-case rate of O(T 1=2). ja ρ(j;a) − concrete example of a decision process is given in the next The reader may wonder why we care about breaking the subsection. square-root barrier within the CFR framework. It is well- At each decision point j 2 J in a sequential decision pro- known that a convergence rate of O(T 1) can be achieved − cess, the decision x^ 2 ∆nj of the agent incurs an (expected) outside the CFR framework. As mentioned previously, this j linear loss h` ; x^ i. The expected loss throughout the whole can be done with first-order methods such as the excessive j j process is therefore P π h` ; x^ i; where π is the prob- gap technique (Nesterov, 2005) or mirror prox (Nemirovski, j j j j j ability of the agent reaching2J decision point j, defined as the 2004) combined with a dilated distance-generating func- product of the probability with which the agent plays each tion (Hoda et al., 2010; Kroer et al., 2015; 2018b). Despite action on the path from the root of the process to j. this, there has been repeated interest in optimistic regret minimization within the CFR framework, due to the strong prac- In extensive-form games where all players have perfect re- tical performance of CFR algorithms. Burch (2017) tries call (that is, they never forget about their past moves or their 1 to implement CFR-like features in the context of O(T − ) observations), all players face a sequential decision process. FOMs and regret minimizers, while Brown & Sandholm The loss vectors f`jg are defined based on the strategies of (2019) experimentally tries optimistic variants of regret min- the opponent(s) as well as the chance player. However, as imizers in CFR. We stress that these prior results are only already observed by Farina et al. (2019), sequential decision experimental; our results are the first to rigorously incorpo- processes are more general and can model other settings as rate optimistic regret minimization in CFR, and the first to well, such as POMDPs and MDPs when the decision maker achieve a theoretical speedup. conditions on the entire history of observations and actions. Notation. Throughout the paper, we use the following no- n 2.1. Example: Sequential Decision Process for the First tation when dealing with R . We use hx; yi to denote the Player in Kuhn Poker dot product x>y of two vectors x and y. We assume that a pair of dual norms k · k; k · k has been chosen. These norms ∗ As an illustration, consider the game of Kuhn poker (Kuhn, need not be induced by inner products.

Stable-Predictive Optimistic Counterfactual Regret Minimization

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support