Stable-Predictive Optimistic Counterfactual Regret Minimization

Gabriele Farina 1 Christian Kroer 2 Noam Brown 1 Tuomas Sandholm 1 3 4 5

Abstract were used as an essential ingredient for all recent milestones in the benchmark domain of (Bowling et al., 2015; The CFR framework has been a powerful tool Moravcˇ´ık et al., 2017; Brown & Sandholm, 2017b). Despite for solving large-scale extensive-form games in this practical success all known CFR variants have a signif- practice. However, the theoretical rate at which icant theoretical drawback: their worst-case convergence 1/2 past CFR-based algorithms converge to the Nash rate is on the order of O(T − ), where T is the number of 1/2 equilibrium is on the order of O(T − ), where T iterations. In contrast to this, there exist first-order methods 1 is the number of iterations. In contrast, first-order that converge at a rate of O(T − ) (Hoda et al., 2010; Kroer 1 methods can be used to achieve a O(T − ) depen- et al., 2015; 2018b). However, these methods have been dence on iterations, yet these methods have been found to perform worse than newer CFR algorithms such as less successful in practice. In this work we present CFR+, in spite of their theoretical advantage (Kroer et al., the first CFR variant that breaks the square-root 2018b;a). dependence on iterations. By combining and ex- tending recent advances on predictive and stable In this paper we present the first CFR variant which breaks regret minimizers for the matrix-game setting we the square-root dependence on the number of iterations. By show that it is possible to leverage “optimistic” leveraging recent theoretical breakthroughs on “optimistic” 3/4 regret minimizers to achieve a O(T − ) conver- regret minimizers for the matrix-game setting, we show how gence rate within CFR. This is achieved by intro- to set up optimistic counterfactual regret minimizers at each ducing a new notion of stable-predictivity, and by information set such that the overall algorithm retains the setting the stability of each counterfactual regret properties needed in order to accelerate convergence. In minimizer relative to its location in the decision particular, this leads to a predictive and stable variant of 3/4 tree. Experiments show that this method is faster CFR that converges at a rate of O(T − ). than the original CFR algorithm, although not as Typical analysis of regret-minimization leads to a conver- 1/2 fast as newer variants, in spite of their worst-case gence rate of O(T − ) for solving zero-sum matrix games. 1/2 O(T − ) dependence on iterations. However, by leveraging the idea of optimistic learning (Chi- ang et al., 2012; Rakhlin & Sridharan, 2013a;b; Syrgka- nis et al., 2015; Wang & Abernethy, 2018), Rakhlin and 1. Introduction Sridharan show in a series of papers that it is possible to 1 converge at a rate of O(T − ) when leveraging cancellations Counterfactual regret minimization (CFR) (Zinkevich et al., that occur due to the optimistic mirror descent (OMD) al- 2007) and later variants such as Monte-Carlo CFR (Lanctot gorithm (Rakhlin & Sridharan, 2013a;b). Syrgkanis et al. + et al., 2009), CFR (Tammelin et al., 2015), and Discounted (2015) build on this idea, and introduce the optimistic follow- CFR (Brown & Sandholm, 2019), have been the practical the-regularized-leader (OFTRL) algorithm; they show that state-of-the-art in solving large-scale zero-sum extensive- even when the players do not employ the same algorithm, a form games (EFGs) for the last decade. These algorithms 3/4 rate of O(T − ) can be achieved as long as each algorithm 1Computer Science Department, Carnegie Mellon Uni- belongs to a class of algorithms that satisfy a stability crite- versity, Pittsburgh PA 15213 2IEOR Department, Columbia rion and leverage predictability of loss inputs. We build on University, New York NY 10027 3Strategic Machine, Inc. this latter generalization. Because we can only perform the 4 5 Robot, Inc. Optimized Markets, Inc.. Corre- optimistic updates locally with respect to counterfactual re- < > spondence to: Gabriele Farina [email protected] , grets we cannot achieve the cancellations that leads to a rate Christian Kroer , Noam 1 Brown , Tuomas Sandholm . each counterfactual regret minimizer it is possible to main- tain predictability and stability with respect to the overall th Proceedings of the 36 International Conference on Machine decision-tree structure, thus leading to a convergence rate of Learning , Long Beach, California, PMLR 97, 2019. Copyright O(T 3/4). In order to achieve these results we introduce a 2019 by the author(s). − Stable-Predictive Optimistic Counterfactual Regret Minimization new variant of stable-predictivity, and show that each local decision node j ∈ J , the agent chooses a strategy from the nj counterfactual regret minimizer must have its stability set simplex ∆ of all probability distributions over the set Aj relative to its location in the overall strategy space, with of nj = |Aj| actions available at that decision node. An regret minimizers deeper in the decision tree requiring more action is sampled according to the chosen distribution, and stability. the agent then waits to play again. While waiting, the agent might receive a signal (observation) from the process; this In addition to our theoretical results we investigate the prac- possibility is represented with an observation node. At a tical performance of our algorithm on several poker sub- generic observation point k ∈ K, the agent might receive games from the Libratus AI which beat top poker profes- n signals; the set of signals that the agent can observe is sionals (Brown & Sandholm, 2017b). We find that our CFR k denoted as S . The observation node that is reached by the variant coupled with the OFTRL algorithm and the entropy k agent after picking action a ∈ A at decision point j ∈ J is regularizer leads to better convergence rate than the vanilla j denoted by ρ(j, a). Likewise, the decision node reached by CFR algorithm with regret matching, while it does not out- the agent after observing signal s ∈ S at observation point perform the newer state-of-the-art algorithm Discounted k k ∈ K is denoted by ρ(k, s). The set of all observation points CFR (DCFR) (Brown & Sandholm, 2019). This latter fact reachable from j ∈ J is denoted as C := {ρ(j, a) : a ∈ A }. is not too surprising, as it has repeatedly been observed that j j Similarly, the set of all decision points reachable from k ∈ K CFR+, and the newer and faster DCFR, converges at a rate is denoted as C := {ρ(k, s) : s ∈ S }. To ease the notation, better than O(T 1) for many practical games of interest, in k k − sometimes we will use the notation C to mean C .A spite of the worst-case rate of O(T 1/2). ja ρ(j,a) − concrete example of a decision process is given in the next The reader may wonder why we care about breaking the subsection. square-root barrier within the CFR framework. It is well- At each decision point j ∈ J in a sequential decision pro- known that a convergence rate of O(T 1) can be achieved − cess, the decision xˆ ∈ ∆nj of the agent incurs an (expected) outside the CFR framework. As mentioned previously, this j linear loss h` , xˆ i. The expected loss throughout the whole can be done with first-order methods such as the excessive j j process is therefore P π h` , xˆ i, where π is the prob- gap technique (Nesterov, 2005) or mirror prox (Nemirovski, j j j j j ability of the agent reaching∈J decision point j, defined as the 2004) combined with a dilated distance-generating func- product of the probability with which the agent plays each tion (Hoda et al., 2010; Kroer et al., 2015; 2018b). Despite action on the path from the root of the process to j. this, there has been repeated interest in optimistic regret min- imization within the CFR framework, due to the strong prac- In extensive-form games where all players have perfect re- tical performance of CFR algorithms. Burch (2017) tries call (that is, they never forget about their past moves or their 1 to implement CFR-like features in the context of O(T − ) observations), all players face a sequential decision process. FOMs and regret minimizers, while Brown & Sandholm The loss vectors {`j} are defined based on the strategies of (2019) experimentally tries optimistic variants of regret min- the opponent(s) as well as the chance player. However, as imizers in CFR. We stress that these prior results are only already observed by Farina et al. (2019), sequential decision experimental; our results are the first to rigorously incorpo- processes are more general and can model other settings as rate optimistic regret minimization in CFR, and the first to well, such as POMDPs and MDPs when the decision maker achieve a theoretical speedup. conditions on the entire history of observations and actions. Notation. Throughout the paper, we use the following no- n 2.1. Example: Sequential Decision Process for the First tation when dealing with R . We use hx, yi to denote the Player in Kuhn Poker dot product x>y of two vectors x and y. We assume that a pair of dual norms k · k, k · k has been chosen. These norms ∗ As an illustration, consider the game of Kuhn poker (Kuhn, need not be induced by inner products. Common examples 1950). Kuhn poker consists of a three-card deck: king, of such norm pairs are the `2 norm which is self dual, and queen, and jack. Each player first has to put a payment of 1 the `1, ` norms, which are are dual to each other. We will into the pot. Each player is then dealt one of the three cards, ∞ p make explicit use of the 2-norm: kxk2 := hx, xi. and the third is put aside unseen. A single round of betting then occurs. The sequential decision process the Player 1 is 2. Sequential Decision Making and EFG shown in Figure 1, where denotes an observation point. In Strategy Spaces that example, we have: J = {X0,X1,X2,X3,X4,X5,X6}; n0 = 1; nj = 2 for all j ∈ J \ {X0}; AX0 = {start}, AX1 =

A sequential decision process can be thought of as a tree AX2 = AX3 = {check, raise}, AX4 = AX5 = AX6 = consisting of two types of nodes: decision nodes and ob- {fold, call}; Cρ(X0,start) = {X1,X2,X3}, Cρ(X1,raise) = ∅, servation nodes. The set of all decision nodes is denoted Cρ(X3,check) = {X6}; etc. as J , and the set of all observation nodes with K. At each Stable-Predictive Optimistic Counterfactual Regret Minimization

X0 • At every decision point j ∈ J , we let start  X4 := (λ , . . . , λn , λ x , . . . , λn x ) : jack queen king j 1 j 1 k1 j knj nj X1 X2 X3 (λ , . . . , λn) ∈ ∆ , x ∈ X4, 1 k1 k1 check raise check raise check raise  xk2 ∈ X4,..., xkn ∈ X4 , (2) k2 j knj check raise check raise check raise X X X 4 5 6 where {k1, k2, . . . , knj } = Cj are the children observa- fold call fold call fold call tion points of j. The sequence form strategy space for the whole sequential Figure 1. The sequential decision process for the first player in the decision process is then Xr4, where r is the root of the game of Kuhn poker. denotes an observation point; small dots process. Crucially, X4 is a convex and compact set, and the represents the end of the decision process. expected loss of the process is a linear function over X4. With the sequence-form representation the problem of com- 2.2. Sequence Form for Sequential Decision Processes puting a Nash equilibriun in an EFG can be formulated as The expected loss for a given strategy, as defined in Sec- a bilinear saddle-point problem (BSPP). A BSPP has the tion 2, is non-linear in the vector of decisions variables form min max x>Ay, (3) (xˆj)j . This non-linearity is due to the product πj of prob- x y abilities∈J of all actions on the path to from the root to j. We ∈X ∈Y where X and Y are convex and compact sets. In the case now present a well-known alternative representation of this of extensive-form games, X = X and Y = Y are the decision space which preserves linearity. 4 4 sequence-form strategy spaces of the sequential decision The alternative formulation is called the sequence form. processes faced by the two players, and A is a sparse matrix In the sequence-form representation, the simplex strategy encoding the leaf payoffs of the game. space at a generic decision point j ∈ J is scaled by the decision variable leading of the last action in the path from 2.3. Notation when Dealing with the Extensive Form the root of the process to j. In this formulation, the value of a particular action represents the probability of playing the In the rest of the paper, we will make heavy use of the se- whole sequence of actions from the root to that action. This quence form and its inductive construction given in (12) allows each term in the expected loss to be weighted only and (13). We will consistently denote sequence-form strate- by the sequence ending in the corresponding action. The gies with a triangle superscript. As we have already ob- sequence form has been used to instantiate linear program- served, vectors that pertain to the sequence-form have one ming (von Stengel, 1996) and first-order methods (Hoda entry for each sequence of the decision process, that is one et al., 2010; Kroer et al., 2015; 2018b) for computing Nash entry for pair (j, a) where j ∈ J , a ∈ Aj. Sometimes, we equilibria of zero-sum EFGs. There is a straightforward will need to slice a vector v and isolate only those entries that refer to all decision points j0 and actions a0 ∈ A 0 that mapping between a vector of decisions (xˆj)j , one for j each decision point, and its corresponding sequence∈J form: are at or below some j ∈ J ; we will denote such operation simply assign each sequence the product of probabilities in as [v] j. Similarly, we introduce the syntax [v]j to denote ↓ n = |A | v the sequence. We will let X4 denote the sequence-form the subset of j j entries of that pertain to all actions a ∈ A at decision point j ∈ J . representation of a vector of decisions (xˆj)j . Likewise, j ∈J going from a sequence-form strategy x4 ∈ X4 to a corre- sponding vector of decisions (xˆj)j can be done by divid- 3. Stable-Predictive Regret Minimizers ∈J ing each entry (sequence) in x4 by the value xp4 where pj j In this paper, we operate within the online learning frame- is the entry in x corresponding to the unique last action 4 work called online convex optimization (Zinkevich, 2003). that the agent took before reaching j. In particular, we restrict our attention to a modern subtopic: Formally, the sequence-form representation X4 of a se- predictive (also often called optimistic) regret minimiza- quential decision process can be obtained recursively, as tion (Chiang et al., 2012; Rakhlin & Sridharan, 2013a;b). follows: As usual in this setting, a decision maker repeatedly plays • At every observation point k ∈ K, we let against an unknown environment by making a sequence 1 2 n of decisions x , x , · · · ∈ X ⊆ R , where the set X of Xk4 := Xj4 × Xj4 × · · · × Xj4 , (1) 1 2 nk feasible decisions for the decision maker is convex and t where {j1, j2, . . . , jnk } = Ck are the children decision compact. The evaluation of the of each decision x points of k. is h`t, xti, where `t ∈ X is a convex loss vector, unknown Stable-Predictive Optimistic Counterfactual Regret Minimization to the decision maker until after the decision is made. The However, there are several important differences: peculiarity of predictive regret minimization is that we also predictions t t 1 assume that the decision maker has access to • Syrgkanis et al. (2015) assume that m = ` − ; this m1, m2,... `1, `2,... t t 1 2 of what the loss vectors will be. In explains the term k` − ` − k in (RVU) instead of ∗ summary, by predictive regret minimizer we mean a device k`t − mtk2 in (6). One of the reason why we do ∗ that supports the following two operations: not make assumptions on mt is that, unlike in matrix • it provides the next decision xt+1 ∈X given a predic- games, we will need to use modified predictions for tion mt+1 of the next loss vector and each local regret minimizer, since we need to predict the local counterfactual loss. • it receives/observes the convex loss vectors `t used to xt P t evaluate decision . • Our notion ignores the cancellation term −γ0 kx − t 1 2 The learning is online in the sense that the decision maker’s x − k ; instead, we require the stabilty property (5). (that is, device’s) next decision, xt+1, is based only on t • The coefficients in the regret bound (6) are forced to the previous decisions x1,..., x , observed loss vectors t be inversely proportional, and tied to the choice of the `1,..., ` , and the prediction of the past loss vectors as t stability parameter κ. Syrgkanis et al. (2015) show well as the next one m1,..., m +1. that same correlation holds for the optimistic follow- Just as in the case of a regular (that is, non-predictive) re- the-regularized leader, but they don’t require it in their gret minimizer, the quality metric for the predictive regret definition of the RVU property. minimizer is its cumulative regret, which is the difference between the loss cumulated by the sequence of decisions Syrgkanis et al. (2015) show that their optimistic follow- 1 T x ,..., x and the loss that would have been cumulated by the-regularized-leader (OFTRL) algorithm, as well as the playing the best-in-hindsight time-independent decision x˜. variant of the mirror descent algorithm presented by Rakhlin Formally, the cumulative regret up to time T is & Sridharan (2013a), satisfy (RVU). In Section 3.2 we show T ( T ) X X that OFTRL also satisfies stable-predictivity. RT := h`t, xti − min h`t, x˜i . (4) x˜ t=1 ∈X t=1 3.1. Relationship with Bilinear Saddle-Point Problems We introduce a new class of predictive regret minimizers whose cumulative regret decomposes into a constant term In this subsection we show how stable-predictive regret plus a measure of the prediction quality, while maintaining minimization can be used to solve a BSPP such as a Nash stability in the sense that the iterates x1,..., xT change equilibrium problem in two-player zero-sum extensive-form slowly. games with perfect recall (Sections 2 and 2.2). The solutions of (3) are called saddle points. The saddle-point residual Definition 1 . A pre- (Stable-predictive regret minimizer) (or gap) ξ of a point (x¯, y¯) ∈ X × Y, defined as dictive regret minimizer is (κ, α, β)-stable-predictive if the following two conditions are met: ξ := max x¯>Ayˆ − min xˆ>Ay¯, yˆ xˆ ∈Y ∈X • Stability. The decisions produced change slowly: (x¯, y¯) t+1 t measures how close is to being a saddle point (the kx − x k ≤ κ ∀ t ≥ 1. (5) lower the residual, the closer). • Prediction bound. For all T , the cumulative regret up to time T is bounded according to It is known that regular (non-predictive) regret minimiza- tion yields an anytime algorithm that produces a sequence T T T T T α X t t ¯ ¯ R ≤ + βκ k` − m k2. (6) of points (x , y ) ∈ X × Y whose residuals are ξ = κ ∗ 1/2 t=1 O(T − ). Syrgkanis et al. (2015) observe that in the con- In other words, small prediction errors only minimally text of matrix games (i.e., when X and Y are simplexes), affect the regret accumulated by the device. If, in par- RVU minimizers that also satisfy the stability condition (5) ticular, the prediction mt matches the loss vector `t can be used in place of regular regret minimizers to improve 3/4 perfectly for all t, the cumulative regret remains asymp- the convergence rate to O(T − ). In what follows, we show totically constant. how to extend the argument to stable-predictive regret mini- mizers and general bilinear saddle-point problems beyond Our notion of stable-predictivity is similar to the Regret Nash equilibria in two-player zero-sum matrix games. bounded by Variation in Utilities (RVU) property given A folk theorem explains the tight connections between low by Syrgkanis et al. (2015), which asserts that regret and low residual (Cesa-Bianchi & Lugosi, 2006). T T T X t t 1 2 X t t 1 2 Specifically, by setting up two regret minimizers (one for R ≤ α0+β0 k` −` − k −γ0 kx −x − k . (RVU) t ∗ X Y ` := t t and one for ) that observe loss vectors given by =1 =1 X Stable-Predictive Optimistic Counterfactual Regret Minimization

t t t −Ay , ` := A>x , the profile of average decisions 4. CFR as Regret Decomposition Y T T ! 1 X 1 X xt, yt ∈ X × Y (7) In this section we offer some insights into CFR, and discuss T T t=1 t=1 what changes need to be made in order to leverage the power has residual ξ bounded from above according to of predictive regret minimization. CFR is a framework for R 1 T T constructing a (non-predictive) regret minimizer 4 that ξ ≤ (R + R ). X T X Y operates over the sequence-form strategy space 4 of a Hence, by letting the predictions be defined as mt := sequential decision process. In accordance with Section 2.3, t 1 t t 1 X R t x ,t ` − , m := ` − , and assuming that the predictive regret we denote the decision produced by 4 at time as 4 ; ,t minimizersX Y areY(κ, α, β)-stable-predictive, we obtain that the the corresponding loss functions is denoted as `4 . residual ξ of the average decisions (7) satisfies One central idea in CFR is to define a localized notion of T 2α X t t 1 2 loss: for all j ∈ J , CFR constructs the following linear T ξ ≤ + βκ k−Ay + Ay − k ˆt, nj κ counterfactual loss function ` ◦ : ∆ → R. Intuitively, the T ∗ j t=1 t, X t t 1 2 `ˆ (x ) x ∈ ∆nj + βκ kA>x − A>x − k counterfactual loss j ◦ j of a local strategy j ∗ t=1 measures the loss that the agent would face were the agent T T ! 2α 2 X t t 1 2 X t t 1 2 allowed to change the strategy at decision point j only. In ≤ + βkAkopκ kx −x − k + ky −y − k ˆt, κ particular, ` ◦(xj) is the loss of an agent that follows the t=1 t=1 j x x ,t j 2α 2 3 strategy j instead of 4 at decision point , but otherwise ≤ + 2βT kAk κ , ,t κ op follows the strategy x4 everywhere else. Formally, where the first inequality holds by (6), the second by noting ˆt, : ,t `j ◦ xj = (xja1 , . . . xjan ) 7→ h[`4 ]j, xji that the operator norm k · kop of a linear function is equal to  j  the operator norm of its transpose, and the third inequality X X ,t ,t + xja h[`4 ] j0 , [x4 ] j0 i. (9) 0 ↓ ↓ by the stability condition (5). This shows that if the stability a Aj j ja parameter κ of the two stable-predictive regret minimizers is ∈ ∈C 1/4 3/4 Θ(T − ), then the saddle point residual is ξ = O(T − ), t, Since `ˆ ◦ is a linear function, it has a unique representation an improvement over the bound ξ = O(T 1/2) obtained j − as a counterfactual loss vector `ˆt , defined as with regular (that is, non-predictive) regret minimizers. j ˆt, ˆt nj `j ◦(xj) = h`j, xji ∀ xj ∈ ∆ . (10) 3.2. Optimistic Follow the Regularized Leader With this local notion of loss function, a corresponding local 1 T Optimistic follow-the-regularized-leader (OFTRL) is a re- notion of regret for a sequence of decisions xˆj ,..., xˆj , gret minimizer introduced by Syrgkanis et al. (2015). At called the counterfactual regret, is defined for each decision each time t, OFTRL outputs the decision point j ∈ J : (* T 1 + ) t t X− t 1 T T x = argmin x˜, m + ` + R(x˜) , (8) ˆT X ˆt t X ˆt η Rj := h`j, xˆji − min h`j, x˜ji. x˜ nj ∈X t=1 x˜j ∆ t=1 ∈ t=1 where η > 0 is a free constant and R(·) is a 1-strongly convex ˆT regularizer with respect to the norm k · k. Furthermore, let Intuitively, Rj represents the difference between the loss t nj ∆R := maxx,y {R(x) − R(y)} denote the diameter of that was suffered for picking xˆj ∈ ∆ and the minimum ∈X t t the range of R, and let ∆` := maxt max{k` k , km k } be loss that could be secured by choosing a different strategy ∗ ∗ the maximum (dual) norm of any loss vector or prediction at decision point j only. This is conceptually different thereof. from the definition of regret of R4, which instead measures the difference between the loss suffered and the best loss A theorem similar to that of Syrgkanis et al. (2015, Propo- that could have been obtained, in hindsight, by picking sition 7), which was obtained in the context of the RVU any strategy from the whole strategy space, with no extra property, can be shown for the stable-predictive framework: constraints.

Theorem 1. OFTRL is a 3∆ (η, ∆ , 1)-stable-predictive With this notion of regret, CFR instantiates one (non-stable- ` R ˆ regret minimizer. predictive) regret minimizer Rj for each decision point j ∈ J . Each local regret minimizer Rˆj operates on the We give a proof of Theorem 1 the appendix. When the domain ∆nj , that is, the space of strategies at decision point loss vectors are further assumed to be non-negative, it can j only. At each time t, R4 prescribes the strategy that, at be shown that OFTRL is 2∆`(η, ∆R, 1)-stable-predictive, each information set j, behaves according to the decision ,t where we have substituted a factor of 2 rather than the factor of Rˆj. Similarly, any loss vector `4 input to R4 is pro- of 3 in Theorem 1. cessed as follows: (i) first, the counterfactual loss vectors Stable-Predictive Optimistic Counterfactual Regret Minimization

ˆt {`j}j , one for each decision point j ∈ J , are computed; The two lemmas above do not make any assumption about ∈J ˆ (ii) then, each Rˆj observes its corresponding counterfactual the nature of the (localized) regret minimizers Rj, and there- ˆt ˆ loss vector `j. fore they are applicable even when the Rj are predictive or, specifically, stable-predictive. Another way to look at CFR and counterfactual losses is as an inductive construction over subtrees. When a loss function relative to the whole sequential decision process 5. Stable-Predictive Counterfactual Regret is received by the root node, inductively each node of the Minimization sequential decision process does the following: Our proposed algorithm behaves exactly like CFR, with • If the node receiving the loss vector is an observation the notable difference that our local regret minimizers Rˆj node, the incoming loss vector is partitioned and for- are stable-predictive and chosen to have specific stability t warded to each child decision node. The partition of parameters. Furthermore, the predictions mj for each local the loss vector is done so as to ensure that only entries regret minimizer Rˆj are chosen so as to leverage the pre- relevant to each subtree are received down the tree. dictivity property of the regret minimizers. Given a desired • If the node receiving the loss vector is a decision node, value of κ∗ > 0, by choosing the stability parameters and the incoming loss vector is first forwarded as-is to each predictions as we will detail later, we can guarantee that R4 1 of the child observation points, and then it is used to is a (κ∗,O(1),O(1))-stable-predictive regret minimizer. ˆt construct the counterfactual loss vector `j which is input into Rˆj. 5.1. Choice of Stability Parameters This alternative point of view differs from the original one, We use the following scheme to pick the stability parameter but has been recently used by Farina et al. (2018; 2019) to of Rˆj. First, we associate a scalar γv to each node v ∈ J ∪K simplify the analysis of the algorithm. When viewed from of the sequential decision process. The value γr of the root the above point of view, CFR is recursively building—in decision node is set to κ∗, and the value for each other node a bottom-up fashion—regret minimizers for each subtree v is set relative to the value γu of their parent starting from child subtrees. γu γu γv := √ if u ∈ J , γv := √ if u ∈ K. (14) 2 nu nu In accordance with our convention (Section 2.3), we denote Rv4, for v ∈ J ∪K, the regret minimizer that operates on Xv4 The stability parameter of each decision point j ∈ J is obtained by only considering the local regret minimizers γj κj := √ , (15) in the subtree rooted at vertex v of the sequential decision 2 njBj ,T process. Analogously, we will denote with Rv4 the regret ,t where Bj is an upper bound on the 2-norm of any vector of Rv4 up to time T , and with `v4 the loss function entering in Xj4. A suitable value of Bj can be found by recursively Rv4 at time t. In accordance with the above construction, using the following rules: for all k ∈ K and j ∈ J , we have that s r X 2 2 ,t ,t ,t ,t Bk = B 0 ,Bj = 1 + max B 0 (16) `4 = [`4 ] ∀k ∈ C , `4 = [`4 ] ∀j ∈ C . j 0 k k j k j j k j k (11) 0 k j ↓ ↓ j k ∈C ∈C Finally, we denote the decisions produced by Rv4 at time t as At each decision point j, any stable-predictive regret mini- ,t xv4 . As per our discussion above, the decisions produced mizer that is able to guarantee the above stability parameter by R4 are tied together inductively according to can be used. For example, one can use OFTRL where the η ,t ,t ,t stepsize is chosen appropriately. As an examples, assum- x4 = (x4 ,..., x4 ) ∀k ∈ K, (12) k j1 jnk ing that all loss vectors involved have (dual) norm bounded by 1/3, we can simply set the stepsize η of the local OFTRL where {j1, . . . , jnk } = Ck, and regret minimizer Rˆj at decision point j to be η = κj.   x ,t = xˆt , xˆt x ,t , ..., xˆt x ,t ∀j ∈J , j4 j ja1 ρ4j,a jan ρ4j,a (13) ( 1) j ( nj ) 5.2. Prediction of Counterfactual Loss Vectors

,t where {a1, . . . , anj } = Aj. The following two lemmas can Let m4 be the prediction received by R4, concerning the ,t be easily extracted from Farina et al. (2018). A proof is future loss vector `4 . We will show how to process the presented in the appendix. prediction and produce counterfactual prediction vectors t ,T X ,T mˆ j (one for each decision point j ∈ J ) for each local stable- Lemma 1. For all k ∈ K, R4 = R4 . k j Rˆ j k predictive regret minimizer j. ∈C 1 ,T ˆT ,T Throughout the paper, our asymptotic notation is always with Lemma 2. For all j ∈ J , Rj4 ≤ Rj + max Rk4 . k j respect to the number of iterations T . ∈C Stable-Predictive Optimistic Counterfactual Regret Minimization

ˆ t Following the construction of the counterfactual loss func- 2. Rj observes the counterfactual loss prediction mˆ j as tions defined in (9), for each decision point j ∈ J we define defined in (17); and t, n the counterfactual prediction function mˆ ◦ : ∆ j → as ˆ ˆt j R 3. Rj observes the counterfactual loss vectors `j as de- t, nj D ,t E mˆ ◦ : ∆ 3 x = (x , . . . , x ) 7→ [m4 ] , x fined in (10), j j ja1 janj j j   then R4 is a (κ∗,O(1),O(1))-stable-predictive regret mini- X X D ,t ,t E + xja [m4 ] j0 , [x4 ] j0 . mizer that acts over the sequence-form strategy space X˜. 0 ↓ ↓ a Aj j ja ∈ ∈C Observation. It important to observe that the counterfac- By combining the above result with the arguments t tual prediction function mˆ j depends on the decisions pro- of Section 3.1, we conclude that by constructing two duced at time t in the subtree rooted at j. In other words, (Θ(T 1/4),O(1),O(1))-stable-predictive regret minimizers, ˆ in order to construct the prediction for what loss Rj will one per player, using the construction above, we obtain an t observe after producing the decision xj, we use the “future” algorithm that can approximate a and at decisions xt from the subtrees below j ∈ J. 3/4 ja time T the average strategy produces an O(T − )-Nash Similarly to what is done for the counterfactual loss function, equilibrium in a two-player zero-sum game. t we define the counterfactual loss prediction vector mˆ j, as n the (unique) vector in Rj such that 6. Experiments t, t nj mˆ ◦(x ) = hmˆ , x i ∀ x ∈ ∆ . (17) j j j j j Our techniques are evaluated in the benchmark domain of heads-up no-limit Texas hold’em poker (HUNL) . 5.3. Proof of Correctness In HUNL, two players P1 and P2 each start the game with We will prove that our choice of stability parameters (14) $20,000. The position of the players switches after each and (localized) counterfactual loss predictions (17) guar- hand. The players alternate taking turns and may choose to antee that R4 is a (κ∗,O(1),O(1))-stable-predictive regret either fold, call, or raise on their turn. Folding results in the minimizer. Our proof is by induction on the sequential de- player losing and the money in the pot being awarded to the cision process structure: we prove that our choices yield other player. Calling means the player places a number of a (γv,O(1),O(1))-stable-predictive regret minimizer in the chips in the pot equal to the opponent’s share. Raising means sub-sequential decision process rooted at each possible node the player adds more chips to the pot than the opponent’s v ∈ J ∪ K. For observation nodes v ∈ K the inductive step share. There are four betting rounds in the game. A round is performed via Lemma 3, while for decision nodes v ∈ J ends when both players have acted at least once and the the inductive step is performed via Lemma 4. The proof of most recent player has called. Players cannot raise beyond both lemmas can be found in the appendix. the $20,000 they start with. All raises must be at least $100 Lemma 3. Let k ∈ K be an observation node, and assume and at least as large as any previous raise in that round. that Rj4 is a (γj,O(1),O(1))-stable-predictive regret mini- At the start of the game P1 must place $100 in the pot and P2 mizer over the sequence-form strategy space Xj4 for each must place $50 in the pot. Both players are then dealt two cards that only they observe from a 52-card deck. A round j ∈ Ck. Then, Rk4 is a (γk,O(1),O(1))-stable-predictive re- of betting then occurs starting with P2. P1 will be the first to gret minimizer over the sequence-form strategy space X4. k act in all subsequent betting rounds. Upon completion of the Lemma 4. Let j ∈ J be a decision node, and assume that first betting round, three community cards are dealt face up. Rk4 is a (γk,O(1),O(1))-stable-predictive regret minimizer After the second betting round is over, another community over the sequence-form strategy space Xk4 for each k ∈ card is dealt face up. Finally, after that betting round one ˆ Cj. Suppose further that Rj is a (κj,O(1),O(1))-stable- more community card is revealed and a final betting round predictive regret minimizer over the simplex ∆nj . Then, occurs. If no player has folded then the player with the best Rj4 is a (γk,O(1),O(1))-stable-predictive regret minimizer five-card poker hand, out of their two private cards and the five community cards wins the pot. The pot is split evenly over the sequence-form strategy space Xj4. if there is a tie. Putting together Lemma 3 and Lemma 4, and using induc- tion on the sequential decision process structure, we obtain The most competitive agents for HUNL solve portions of the following formal statement. the game (referred to as subgames) in real time during play (Brown & Sandholm, 2017a; Moravcˇ´ık et al., 2017; Corollary 1. Let κ > 0. If: ∗ Brown & Sandholm, 2017b; Brown et al., 2018). For ex- 1. Each localized regret minimizer Rˆj is (κj,O(1),O(1))- ample, Libratus solved in real time the remainder of HUNL stable-predictive and produces decisions over the local starting on the third betting round. We conduct our exper- nj (simplex) action space ∆ , where κj is as in (15); and iments on four open-source subgames solved by Libratus Stable-Predictive Optimistic Counterfactual Regret Minimization in real time during its competition against top humans in this we find that in the larger subgames 1 and 2 the OFTRL HUNL.2 Following prior convention, we use the bet sizes stepsize is much too conversative, and the algorithm barely of 0.5× the size of the pot, 1× the size of the pot, and the starts to make progress within the number of iterations that all-in bet for the first bet of each round. For subsequent bets we run. With a moderately-hand-tuned stepsize OFTRL in a round, we consider 1× the pot and the all-in bet. beats CFR somewhat significantly. In all games DCFR performs better than OFTRL, and also significantly better Subgames 1 and 2 occur over the third and fourth betting than its theory predicts. This is not too surprising, as both round. 1 has $500 in the pot at the start of the CFR+ and the improved DCFR are known to significantly game while Subgame 2 has $4,780. Subgames 3 and 4 outperform their theoretical convergence rate in practice. occur over only the fourth betting round. Subgame 1 has $500 in the pot at the start of the game while Subgame Figure 3 shows the results for alternating updates on sub- 4 has $3,750. We measure exploitability in terms of the games 2 and 4, while subgames 1 and 3 are given in the standard metric: milli big blinds per game (mbb/g), which appendix in Figure 5. In the alternating-updates setting is the number of big blinds (P1’s original contribution to the OFTRL performs worse relative to CFR and DCFR. In pot) lost per hand of poker multiplied by 1,000 and is the subgame 1 OFTRL with stepsizes set according to the the- standard measurement of win rate in the related literature. ory slightly outperforms CFR, but in subgame 2 they have near-identical performance. In subgames 3 and 4 even the We compare the performance of three algorithms: vanilla manually-tuned variant performs worse than CFR, although CFR (i.e. CFR with regret matching; labeled CFR in plots), we suspect that it is possible to improve on this with a the current state-of-the-art algorithm in practice, Discounted better choice of stepsize parameter. In the alternating set- CFR (Brown & Sandholm, 2019) (labeled DCFR in plots), ting DCFR performs significantly better than all other algo- and our stable-predictive variant of CFR with OFTRL at rithms. each decision point (labeled OFTRL in plots). DCFR was set up with parameters (α, β, γ) = (1.5, 0, 2), as recom- Subgame 2, alternating Subgame 4, alternating 105 104 mended in the original DCFR paper. For OFTRL we use 104 103 103 the stepsize that the theory suggests in our experiments on 2 102 10 1 subgames 3 and 4 (labeled OFTRL theory). For subgames 101 10 CFR 0 0 DCFR 10 CFR 1 and 2 we found that the theoretically-correct stepsize is 10 OFTRL theory DCFR 1 OFTRL tuned 1 OFTRL theory much too conservative, so we also show results with a less- 10− 10− Saddle point gap (mbb/g) 101 102 103 101 102 103 conservative parameter found through dividing the stepsize Iteration Iteration by 10, 100, and 1000, and picking the best among those (labeled OFTRL tuned). For all games we show two plots: Figure 3. Convergence rate with iterations on the x-axis, and the one where all algorithms use simultaneous updates, as CFR exploitability in mbb. All algorithms use alternating updates. traditionally uses, and one where all algorithms use alternat- ing updates, a practical change that usually leads to better performance. 7. Conclusions

Subgame 2, simultaneous Subgame 4, simultaneous We developed the first variant of CFR that converges at a 1/2 4 T 104 10 rate better than − . In particular we extend the ideas of 103 103 predictability and stability for optimistic regret minimiza- 102 102 tion on matrix games to the setting of EFGs. In doing so we CFR 1 DCFR CFR showed that stable-predictive simplex regret minimizers can 10 OFTRL theory 1 DCFR OFTRL tuned 10 OFTRL theory 100 be aggregated to form a stable-predictive variant of CFR for

Saddle point gap (mbb/g) 1 2 3 1 2 3 10 10 10 10 10 10 sequential decision making, and we showed that this leads Iteration Iteration 3/4 to a convergence rate of O(T − ) for solving two-player Figure 2. Convergence rate with iterations on the x-axis, and the zero-sum EFGs. Our result makes the first step towards exploitability in mbb. All algorithms use simultaneous updates. reconciling the gap between the theoretical rate at which 1 CFR converges, and the rate at which O(T − ) first-order Figure 2 shows the results for simultaneous updates on sub- methods converge. games 2 and 4, while Figure 4 in the appendix for subgames 1 and 3. In the smaller subgames 3 and 4 we find that OFTRL Experimentally we showed that our CFR variant can outper- with the stepsize set according to our theory outperforms form CFR on some games, but that the choice of stepsize CFR: in subgame 4 almost immediately and significantly, in is important, while we find that DCFR is faster in practice. subgame 3 only after roughly 800 iterations. In contrast to An important direction for future work is to find variants of our algorithm that still satisfy the theoretical guarantee and 2https://github.com/CMU-EM/LibratusEndgames perform even better in practice. Stable-Predictive Optimistic Counterfactual Regret Minimization

Acknowledgments Kroer, C., Waugh, K., Kılınc¸-Karzan, F., and Sandholm, T. Faster first-order methods for extensive-form game This material is based on work supported by the National solving. In Proceedings of the ACM Conference on Eco- Science Foundation under grants IIS-1718457, IIS-1617590, nomics and Computation (EC), 2015. and CCF-1733556, and the ARO under award W911NF- 171-0082. Noam Brown is also sponsored by an Open Kroer, C., Farina, G., and Sandholm, T. Solving large Philanthropy Project AI Fellowship and a Tencent AI Lab sequential games with the excessive gap technique. In Fellowship. Proceedings of the Annual Conference on Neural Infor- mation Processing Systems (NIPS), 2018a. References Kroer, C., Waugh, K., Kılınc¸-Karzan, F., and Sandholm, T. Bowling, M., Burch, N., Johanson, M., and Tammelin, O. Faster algorithms for extensive-form game solving via Mathematical Program- Heads-up limit hold’em poker is solved. Science, 347 improved smoothing functions. ming (6218), January 2015. , pp. 1–33, 2018b. Kuhn, H. W. A simplified two-person poker. In Kuhn, H. W. Brown, N. and Sandholm, T. Safe and nested subgame solv- and Tucker, A. W. (eds.), Contributions to the Theory of ing for imperfect-information games. In Proceedings of Games, volume 1 of Annals of Mathematics Studies, 24, the Annual Conference on Neural Information Processing pp. 97–103. Princeton University Press, Princeton, New Systems (NIPS), pp. 689–699, 2017a. Jersey, 1950. Brown, N. and Sandholm, T. Superhuman AI for heads-up Lanctot, M., Waugh, K., Zinkevich, M., and Bowling, M. no-limit poker: Libratus beats top professionals. Science, Monte Carlo sampling for regret minimization in exten- pp. eaao1733, Dec. 2017b. sive games. In COLT (Conference on Learning Theory) workshop on Online Learning with Limited Feedback, Brown, N. and Sandholm, T. Solving imperfect-information 2009. games via discounted regret minimization. In AAAI Con- ference on Artificial Intelligence (AAAI), 2019. Moravcˇ´ık, M., Schmid, M., Burch, N., Lisy,´ V., Morrill, D., Bard, N., Davis, T., Waugh, K., Johanson, M., and Bowl- Brown, N., Sandholm, T., and Amos, B. Depth-limited ing, M. Deepstack: Expert-level artificial intelligence in solving for imperfect-information games. In Neural In- heads-up no-limit poker. Science, 356(6337), May 2017. formation Processing Systems, NeurIPS 2018, 2018. Nemirovski, A. Prox-method with rate of convergence O(1/t) for variational inequalities with Lipschitz contin- Burch, N. Time and Space: Why Imperfect Information uous monotone operators and smooth convex-concave Games are Hard. PhD thesis, University of Alberta, 2017. saddle point problems. SIAM Journal on Optimization, Cesa-Bianchi, N. and Lugosi, G. Prediction, learning, and 15(1), 2004. games. Cambridge university press, 2006. Nesterov, Y. Excessive gap technique in nonsmooth con- vex minimization. SIAM Journal of Optimization, 16(1), Chiang, C.-K., Yang, T., Lee, C.-J., Mahdavi, M., Lu, C.-J., 2005. Jin, R., and Zhu, S. Online optimization with gradual variations. In Conference on Learning Theory, pp. 6–1, Rakhlin, A. and Sridharan, K. Online learning with pre- 2012. dictable sequences. In Conference on Learning Theory, pp. 993–1019, 2013a. Farina, G., Kroer, C., and Sandholm, T. Composability of Regret Minimizers. arXiv e-prints, art. arXiv:1811.02540, Rakhlin, S. and Sridharan, K. Optimization, learning, and Advances in Neural November 2018. games with predictable sequences. In Information Processing Systems, pp. 3066–3074, 2013b. Farina, G., Kroer, C., and Sandholm, T. Online convex opti- Syrgkanis, V., Agarwal, A., Luo, H., and Schapire, R. E. mization for sequential decision processes and extensive- Fast convergence of regularized learning in games. In form games. In AAAI Conference on Artificial Intelli- Advances in Neural Information Processing Systems, pp. gence (AAAI), 2019. 2989–2997, 2015. Hoda, S., Gilpin, A., Pena,˜ J., and Sandholm, T. Smooth- Tammelin, O., Burch, N., Johanson, M., and Bowling, M. ing techniques for computing Nash equilibria of sequen- Solving heads-up limit Texas hold’em. In Proceedings tial games. Mathematics of Operations Research, 35(2), of the 24th International Joint Conference on Artificial 2010. Intelligence (IJCAI), 2015. Stable-Predictive Optimistic Counterfactual Regret Minimization von Stengel, B. Efficient computation of behavior strategies. Games and Economic Behavior, 14(2):220–246, 1996. Wang, J.-K. and Abernethy, J. D. Acceleration through optimistic no-regret dynamics. In Advances in Neural Information Processing Systems, pp. 3828–3838, 2018. Zinkevich, M. Online convex programming and generalized infinitesimal gradient ascent. In International Conference on Machine Learning (ICML), pp. 928–936, Washington, DC, USA, 2003. Zinkevich, M., Bowling, M., Johanson, M., and Piccione, C. Regret minimization in games with incomplete informa- tion. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), 2007. Stable-Predictive Optimistic Counterfactual Regret Minimization A. Proofs A.1. Optimistic Follow-the-Regularized-Leader We offer a proof of Theorem 8. First, we introduce the following argmin-function:

 1  x˜ : L 7→ argmin hx, Li + R(x) . (18) x η ∈X t Pt τ Furthermore, let L := τ=1 ` . With this notation, the decisions produced by OFTRL, as defined in (8), can be expressed t t 1 t as x =x ˜(L − + m ). Continuity of the argmin-function. The first step in the proof is to study the continuity of the argmin-function x˜. Intuitively, the role of the regularizer R is to smooth out the linear objective function h·, Li. So, it seems only reasonable to expect that, the higher the constant that multiplies R, the less the argmin x˜(L) is affected by small changes of L. In fact, the following holds: Lemma 5. The argmin-function x˜ is η-Lipschitz continuous with respect to the dual norm, that is

kx˜(L) − x˜(L0)k ≤ ηkL − L0k . ∗

Proof. The variational inequality for the optimality of x˜(L) implies

 1  L + ∇R(˜x(L)), x˜(L0) − x˜(L) ≥ 0. (19) η

Symmetrically for x˜(L0), we find that

 1  L0 + R(˜x(L0)), x˜(L) − x˜(L0) ≥ 0. (20) η

Summing inequalities 19 and 20, we obtain

1 ∇R(˜x(L)) − ∇R(˜x(L0)), x˜(L) − x˜(L0) ≤ L0 − L, x˜(L) − x˜(L0) . η Using strong convexity of R(·) on the left-hand side and the generalized Cauchy-Schwarz inequality on the right-hand side, we obtain 1 2 kx˜(L) − x˜(L0)k ≤ kx˜(L) − x˜(L0)k kL − L0k , η ∗ and dividing by kx˜(L) − x˜(L0)k we obtain the Lipschitz continuity of the argmin-function x˜.

A direct consequence of Lemma 5 is the following corollary, which measures the stability (small step size) of the decisions output by OFTRL: t t 1 Corollary 2. At each time t, the iterates produced by OFTRL satisfy kx − x − k ≤ 3η∆`.

Proof.

t t 1 t 1 t t 2 t 1 kx − x − k = x˜(L − + m ) − x˜(L − + m − ) t 1 t t 1 ≤ ηk` − + m − m − k ≤ 3η∆`, ∗ where the first inequality holds by Lemma 5 and the second one by definition of ∆` and the triangle inequality.

The rest of the proof, specifically the predictivity parameters α and β of OFTRL follow directly from the proof of Theorem 19 in the appendix of Syrgkanis et al. (2015). Stable-Predictive Optimistic Counterfactual Regret Minimization

A.2. Regret Bounds ,T X ,T Lemma 1. For all k ∈ K, Rk4 = Rj4 . j k ∈C

,T Proof. By definition of Rk4 , T T ,T X ,t ,t X ,t Rk4 = h`k4 , xk4 i − min h`k4 , x˜k4i. x˜4 X4 t=1 k ∈ k t=1

By using (12) and (11), we can break the dot products and the minimization problem into independent parts, one for each j ∈ Ck:

T T ,T X X ,t ,t X X ,t Rk4 = h`j4 , xj4 i − min h`j4 , x˜j4i x˜4 X4 j k t=1 j k j j t=1 ∈C ∈C ∈ T T ! X X ,t ,t X ,t = h`j4 , xj4 i − min h`j4 , x˜j4i x˜4 X4 j k t=1 j j t=1 ∈C ∈ X ,T = Rj4 , j k ∈C as we wanted to show.

,T ˆT ,T Lemma 2. For all j ∈ J , Rj4 ≤ Rj + max Rk4 . k j ∈C

,T Proof. By definition of Rj4 ,

T T ,T X ,t ,t X ,t Rj4 = h`j4 , xj4 i − min h`j4 , x˜j4i. x˜4 X4 t=1 j ∈ j t=1

By combining (13) and (11), we can break the dot products and the minimization problem into independent parts, one for each k ∈ Cj, as well as a part that depends solely on xˆj:

  T X  X  ,T  ,t ˆt ˆt ,t ,t  Rj4 = h[`j4 ]j, xji + xjah`k4 , xk4 i t=1  a Aj  k=∈ρ(j,a)      T ! T !  X ,t X X ,t  − min h[`4 ]j, x˜ji + x˜ja min h`4 , x˜4i nj j 4 4 k k x˜j ∆  x˜ X  ∈  t=1 a Aj k ∈ k t=1   k=∈ρ(j,a)    T X  X   ,t ˆt ˆt ,t ,t  = h[`j4 ]j, xji + xjah`k4 , xk4 i t=1  a Aj  k=∈ρ(j,a)      T ! T !  X ,t X ,T X ,t ,t  − min h[`4 ]j, x˜ji + x˜ja −R4 + h`4 , x4 i nj j k k k x˜j ∆   ∈  t=1 a Aj t=1   k=∈ρ(j,a)  Stable-Predictive Optimistic Counterfactual Regret Minimization   T X  X   ,t ˆt ˆt ,t ,t  ≤ h[`j4 ]j, xji + xjah`k4 , xk4 i t=1  a Aj  k=∈ρ(j,a)       T  X  ,t X ,t ,t  X ,T − min h[`4 ]j, x˜ji + x˜jah`4 , x4 i + max x˜jaR4 , nj  j k k  nj k x˜j ∆    x˜j ∆ ∈ t=1 a Aj  ∈ a Aj  k=∈ρ(j,a)  ∈ ,T where the equality follows by the definition of Rk4 , and the inequality follows from breaking the minimization of a sum into a sum of minimization problems. By identifying the difference between the first two terms as the counterfactual regret ˆT ˆ Rj (that is, the regret of Rj up to time T ), we obtain ,T ˆT X ,T ˆT ,T R4 ≤ Rj + max x˜jaR4 = Rj + max R4 , j nj k k x˜j ∆ k j ∈ k j ∈C ∈C as we wanted to show.

A.3. Stable-Predictive Regret Minimizer We will prove both Lemma 3 and Lemma 4 with respect to the 2-norm. This does not come at the cost of generality, since all norms are equivalent on finite-dimensional vector spaces, that is, for every choice of norm k · k, there exist constants m, M > 0 such that for all x, mkxk ≤ kxk2 ≤ Mkxk.

Lemma 3. Let k ∈ K be an observation node, and assume that Rj4 is a (γj,O(1),O(1))-stable-predictive regret minimizer over the sequence-form strategy space Xj4 for each j ∈ Ck. Then, Rk4 is a (γk,O(1),O(1))-stable-predictive regret minimizer over the sequence-form strategy space Xk4.

Proof. By hypothesis, for all j ∈ Ck we have T ,T O(1) X ,t ,t 2 Rj4 ≤ + O(1)γj k`j4 − mj4 k2 (21) γj t=1 and ,t ,t 1 kxj4 − xj4 − k2 ≤ γj, (22) ,t where xj4 is the decision output by Rj4 at time t. Substituting (21) into the regret bound of Lemma 1: T ,T X 1 X X ,t ,t 2 Rk4 ≤ O(1) + O(1) γjk`j4 − mj4 k2 γj j k j k t=1 ∈C ∈C 3/2 T nk γk X X ,t ,t 2 ≤ O(1) + O(1) √ k`j4 − mj4 k2 γk nk t=1 j k ∈C T O(1) X ,t ,t 2 = + O(1)γ k`4 − m4 k (23) γ k k k 2 k t=1 √ where the second inequality comes from substituting the value γj = γk/ nk as per (14), and the equality comes from the ,t ,t ,t ,t fact that the `j4 and mj4 form a partition of the vectors `k4 and mk4 , respectively.

We now analyze the stability properties of Rk4: s s ,t ,t 1 X ,t ,t 1 2 X 2 kxk4 − xk4 − k2 = kxj4 − xj4 − k2 ≤ γj = γk, j k j k ∈C ∈C where the first equality follows from (1), the inequality holds by (22) and the second equality holds by substituting the value √ γj = γk/ nk as per (14). This shows that Rk4 is γk-stable. Combining this with the predictivity bound (23) above, we obtain the claim. Stable-Predictive Optimistic Counterfactual Regret Minimization

Lemma 4. Let j ∈ J be a decision node, and assume that Rk4 is a (γk,O(1),O(1))-stable-predictive regret minimizer over ˆ the sequence-form strategy space Xk4 for each k ∈ Cj. Suppose further that Rj is a (κj,O(1),O(1))-stable-predictive regret nj minimizer over the simplex ∆ . Then, Rj4 is a (γk,O(1),O(1))-stable-predictive regret minimizer over the sequence-form strategy space Xj4.

Proof. By hypothesis, for all k ∈ Cj we have T ,T O(1) X ,t ,t 2 R4 ≤ + O(1)γ k`4 − m4 k (24) k γ k k k 2 k t=1 and ,t ,t 1 kxk4 − xk4 − k2 ≤ γk. (25) We substitute (24) into the regret bound of Lemma 2. The key observation is that the loss vector—and their predictions— entering the subtree rooted at k (k ∈ Cj) are simply forwarded from j; with this, we obtain: T T ˆT O(1) X ,t ,t 2 R j ≤ Rj + + O(1)γk k`j4 − mj4 k2. (26) 4 γ k t=1

On the other hand, by hypothesis Rˆj is a (κj,O(1),O(1))-stable-predictive regret minimizer. Hence, T ˆT O(1) X ˆt t 2 Rj ≤ + O(1)κj k`j − mˆ jk2 κj t=1 T O(1) X ,t ,t 2 = + O(1)γj k`j4 − mj4 k2, (27) γj t=1 where the equality comes from the definition of κj (Equation (15)) and the fact that ˆt t 2 X ,t 2 ,t ,t 2 k`j − mˆ jk2 ≤ kxk4 k2 · k`k4 − mk4 k2 k j ∈C ,t ,t 2 X 2 ≤ k`j4 − mj4 k2 Bk k j ∈C ,t ,t 2 = O(1)k`j4 − mj4 k2. By substituting (27) into (26) and noting that γk = O(1)γj, we obtain T ,T O(1) X ,t ,t 2 Rj4 ≤ + O(1)γj k`j4 − mj4 k2, γj t=1 which establishes the predictivity of Rj4.

To conclude the proof, we show that Rj4 has stability parameter γj. To this end, note that by (2)     2

,t ,t 1 2 X t ,t X t 1 ,t 1 t t 1 2 kx4 − x4 − k2 =  xˆjax4  −  xˆ − x4 −  + kxˆj − xˆ − k2 j j ja ja ja j a Aj a Aj ∈ ∈ 2   t t 1 2 X ,t 2 X ,t ,t 1 2 ≤ kxˆj − xˆj− k2 1 + 2 kxk4 k2 + 2 kxk4 − xk4 − k2 k j k k ∈C ∈C 2 t t 1 2 X ,t ,t 1 2 ≤ 2njBj kxˆj − xˆj− k2 + 2 kxk4 − xk4 − k2, k k ∈C where we have used the Cauchy-Schwarz inequality and the definition of Bj (Equation 16). By using the stability of Rˆj, t t 1 2 2 2 2 that is kxˆj − xˆj− k2 ≤ κj = γj /(4njBj ), as well as the hypothesis (25) and (14): 2  2 2  2 ,t ,t 1 γj X γj γj γj 2 kxj4 − xj4 − k2 ≤ + 2 √ = + 2nj √ = γj . 2 2 nj 2 2 nj k j ∈C

Hence, Rj4 has stability parameter γj as we wanted to show. Stable-Predictive Optimistic Counterfactual Regret Minimization B. Experiments

Subgame 1, simultaneous Subgame 3, simultaneous 104 104 103 103 102 102

1 1 CFR 10 CFR 10 DCFR DCFR OFTRL theory OFTRL theory OFTRL tuned Saddle point gap (mbb/g) Saddle point gap (mbb/g) 100 101 102 103 101 102 103 Iteration Iteration

Figure 4. Convergence rate with iterations on the x-axis, and the exploitability in mbb. All algorithms use simultaneous updates.

Subgame 1, alternating Subgame 3, alternating 104 4 103 10 103 102 102 101 101 0 CFR 10 CFR 100 DCFR DCFR OFTRL theory OFTRL theory OFTRL tuned Saddle point gap (mbb/g) 1 Saddle point gap (mbb/g) 1 10− 10− 101 102 103 101 102 103 Iteration Iteration

Figure 5. Convergence rate with iterations on the x-axis, and the exploitability in mbb. All algorithms use alternating updates.