JOURNAL OF IEEE INTELLIGENT SYSTEMS 1 Solving imperfect-information games via exponential counterfactual regret minimization

Huale Li, Xuan Wang, Shuhan Qi, Jiajia Zhang, Yang Liu, Yulin Wu, Fengwei Jia

Abstract—In general, two-agent decision-making problems can be modeled as a two-player game, and a typical solution is to find a in such game. Counterfactual regret minimization (CFR) is a well-known method to find a Nash equilibrium in a two-player zero-sum game with imperfect information. The CFR method adopts a regret matching algorithm iteratively to reduce regret values progressively, enabling the average strategy to approach a Nash equilibrium. Although CFR-based methods have achieved significant success in the field of imperfect information games, there is still scope for improvement in the efficiency of convergence. To address this challenge, we propose a novel CFR-based method named exponential counterfactual regret minimization (ECFR). With ECFR, an exponential weighting technique is used to reweight the instantaneous regret value during the process of iteration. A theoretical proof is provided to guarantees convergence of the ECFR algorithm. The result of an extensive set of experimental tests demostrate that the ECFR algorithm converges faster than the current state-of-the-art CFR-based methods.

Index Terms—Decision-making, Counterfactual regret minimization, Nash equilibrium, Zero-sum games, Imperfect information. !

1 INTRODUCTION

AME theory has often been regarded as a touchstone Moreover, the CFR-based methods have achieved no- G to verify the theory of artificial intelligence [1]. In gen- table success in the domain of IIGs, especially in eral, games can be divided into games games [6], [12], [13], [14]. Libratus is the first agent to beat (PIGs) and imperfect information games (IIGs), according top human players in heads-up no-limit Texas Hold’em to whether the player has complete knowledge of the game poker [12]. DeepStack also defeated professional poker play- state of all the other players. Examples of PIGs are Go and ers, and its method has been demonstrated to be sound with , whereas games like poker are IIGs that players can- a theoretical proof [13]. Pluribus, the latest computer pro- not observe the complete game state of the other players. In gram based on CFR, defeated top poker players in six-player other words, compared with the PIGs, the IIGs usually holds no-limit Texas Hold’em poker, which can be recognized as certain private information. Recently IIGs have attracted a a milestone in the field of artificial intelligence and game great deal of attention from researchers, since IIGs are much theory [14]. more challenging compared with PIGs. In this paper, we Although the agents based on the CFR method have mainly focus on the study of decision-making for IIGs [2]. achieved conspicuous success, they are not directly improv- In order to solve a two-player zero-sum game with ing the vanilla CFR method itself. In other words, these imperfect information, a typical solution is to find a Nash agents are operated by combining CFR with other tech- equilibrium strategy [3]. As a popular method of com- niques (abstraction, neural network, etc.) to solve the game puting the Nash equilibrium strategy, counterfactual re- strategy. Moreover, vanilla CFR is an iterative strategy solv- gret minimization (CFR) [4] has attracted the attention of ing method that relies on increasing the number of training many researchers due to its sound theoretical guarantee iterations to improve the accuracy of the strategy. However, of convergence. Over the past decade, many variants of this means that finding a robust strategy may require a large arXiv:2008.02679v2 [cs.GT] 4 Dec 2020 CFR have been developed [4], [5], [6], [7], [8]. For example, number of iterations, which leads a significant amount of Monte Carlo counterfactual regret minimization (MCCFR) time for solving. This makes CFR-based methods difficult is a sample-based CFR algorithm, which combines Monte to be applied in some real situations. Therefore, it is worth Carlo methods with standard or vanilla CFR to compute investigating how to accelerate the convergence of the CFR approximate equilibria strategies in IIGs [5]. CFR+ uses a and obtain a robust strategy more efficiently. variant of regret matching (regret matching+) where regrets Towards this goal, we propose an exponential CFR are constrained to be non-negative [6]. Double neural CFR (ECFR), which speedups the CFR by focusing more atten- [9] and Deep CFR [10] combine deep neural networks with tion on the actions with greater advantage. Actually, the vanilla CFR and linear CFR (LCFR) respectively. Moreover, regret value of an action represents the advantage of this there are many other CFR variants, such as public chance action compared with the expected action of current strat- sampling CFR (PCCFR) [7], variance reduction in MCCFR egy. Therefore, we utilize average regret value as a threshold (VR-MCCFR) [11] and discounted CFR (DCFR) [8]. to filter the advantage action. Then, the action with different advantage will be weighted differently, i.e., the actions with higher advantage will be given higher weight, and vice • School of Computer Science and Technology, Harbin Institute of Technol- versa. In this way, the updated strategy will have a tendency ogy, Shenzhen, China,518055 to seclect the advantage action. And finally, the convergence E-mail: [email protected] of CFR is accelerated. JOURNAL OF IEEE INTELLIGENT SYSTEMS 2

To be specific, we adopt regret value as the metric to evaluate the advantage of actions. We propose an exponen- C tial weighting technique that applies an exponential weight to actions with a higher regret. This causes the iterative P1 P1 algorithm to pay more attention on actions with higher regret values and so accelerates the training process. Also, in contrast to traditional methods, actions with negative 0.5, -0.5 -0.5, 0.5 regret values are also considered, instead of setting their Information

probabilities to zero. We summarize our contributions as Set P2 P2 Forfeit follows: Forfeit Payoff

1) We present an exponential weighting technique that -1, 1 1, -1 1, -1 1, -1 1, -1 -1, 1 is applied to vanilla CFR, called exponential CFR (ECFR), which makes the convergence of CFR more efficiently. Fig. 1. The game tree of the Coin Toss game. (”C” represents a chance 2) We give a proof of convergence for our ECFR, which, node. P1 and P2 are game players. A coin is flipped and lands either Heads or Tails with equal probability, but only player P1 can see the like CFR, provides it with a solid theoretical basis. . The information set of P2 is the dotted line between the two 3) Three different games (Kuhn poker, Leduc poker and P2 nodes, which means that player P2 cannot distinguish between the Royal poker) are used to evaluate our ECFR. Exten- two states.) sive experimental results show that ECFR converges faster in the three kinds of games compared with the TABLE 1 Description of variables current state-of-the-art CFR-based methods.

The rest of our paper is organized as follows. In Sect.2 Variable Meaning we introduce the concepts of extensive form game, describe u the utility function where ui represents the utility of Nash equilibrium, and CFR. Our proposed method is de- player i scribed in Sect.3, which includes details of the ECFR and its H a limited set of sequences, which is the possible set of proof of convergence. In Sect.4, we evaluate the performance historical actions of the ECFR on four varieties of IIG. Finally, in Sect.5, we σ the strategy where σ is the strategy of player i, and σ is present the conclusions of the paper. i −i the strategy of the other player

Ii the information set of player i 2 NOTATION AND PRELIMINARIES πσ(h) the joint probability of reaching h if all players play σ In this section, some notation and definitions of extensive- according to σ, πi (h) is the probability of reaching h form games are introduced. Then, the concept of Nash if player i operates according to σ equilibrium is described. Finally, an overview of CFR is provided. limited set of sequences that represent the possible historical 2.1 Extensive-Form Game Model actions. P is the player function. P (h) is the player who takes action a after history h. If P (h) = c, then chance The extensive-form game is a classical model for sequential determines the action after history h. fc is a function that decision-making. It is particularly useful in the field of IIGs, associates every history h, and fc(a | h) is the probability in which some decision-makers have private information that action a occurs given history h. I is the information set. to each other. Before presenting the formal definition, we For any information set Ii belonging to player i, all nodes consider a specific example of the Coin Toss game. Fig. 1 0 h, h ∈ Ii are indistinguishable to player i. u is a utility represents a game tree for the game of Coin Toss. function for every termination state. Due to the number of As shown in Fig. 1, each node represents a game state symbols in this paper, we give a brief list of variables for within the game tree. A leaf node, which is also known as further reference in Table 1. a terminal node, indicates that the game has ended. The corresponding payoff is returned after the game ends. In addition, the edge between two nodes represents an action 2.2 Nash Equilibrium or decision taken by a game player. A coin is flipped and The Nash equilibrium is a fundamental concept in game lands either Heads or Tails with equal probability in Coin theory, which lays the theoretical foundation for many Toss game, but only player P1 knows the outcome. In Fig. 1, studies. A Nash equilibrium is usually used to computed player P1 can choose between the actions Left and Right, the strategy of a two-player extensive-form game in the with the action Left leading directly to obtaining a payoff. If field of the IIG, which can be also called a non-cooperative the action Right is selected by player P1, then player P2 has game equilibrium (NE) [3]. To better understand the Nash an opportunity to guess how the coin landed. If P2 guesses equilibrium, here we introduce the concept of a strategy as correctly, P1 will receive a reward of -1 and P2 will receive follows. a reward of 1 [15]. Strategy σ is a probability vector over actions in the Generally, a finite extensive-form game with imper- extensive-form game, where σi is the strategy of player i. fect information has six components, represented as σi(I, a) represents the probability of player i taking action hN, H, P, fc, I, ui [16]: N represents the game players. H is a a under the information set I. σ−i refers to all the strategies JOURNAL OF IEEE INTELLIGENT SYSTEMS 3

0 σ in σ except player i ’s strategy σi. π (h) is the probability of 3 OUR METHOD history h that occurs only if the game player takes the legal σ In this section, firstly, the exponential weighting technique actions according to strategy σ. π−i(h) of a history h is the is presented in Sect. 3.1. Secondly, the process of ECFR is contribution of chance and all players other than the player introduced in Sect. 3.2. Thirdly, the proof of convergence i. ui (σi, σ−i) is the expected payoff for player i if all players of ECFR is given in Sect. 3.3. Finally, the differences be- play according to the strategy profile (σi, σ−i). tween our method and other CFR-based methods will be A to σ−i is a strategy of player i, BR(σ−i) 0 discussed. u (BR (σ ) , σ ) = max 0 u (σ , σ ) such that i −i −i σi i i −i . A Nash equilibrium σ∗ is a strategy profile where everyone plays a ∗ ∗  0 ∗  3.1 Exponential Weighting Technique ∀i, u σ , σ = max 0 u σ , σ best response: i i −i σi i i −i [3]. In recent years, there have been many improved methods 2.3 Counterfactual Regret Minimization based on vanilla CFR. For instance, Discounted CFR (DCFR) [8], Linear CFR (LCFR) [10] and dynamic thresholding for Counterfactual regret minimization (CFR) is a popular iter- CFR [17] aim at speeding up the convergence of vanilla CFR. ative method to find the Nash equilibrium strategy in two- Among these methods, both of LCFR and DCFR primar- player zero-sum games with imperfect information [4]. In ily balance the weight of regret generated in the early and general, CFR can be divided into two steps: regret calcu- later iterations. For LCFR [10] and DCFR [8], their improve- lation and regret matching. The former one is to calculate ments over vanilla CFR are in their way of weighting the the regret values of actions in each iteration. And the latter regret value. LCFR uses the number of iterations to modify one is to update global strategy. We provide an overview of the weight of the regret value. DCFR also reweights regret vanilla CFR as the following. values, but this weight is different for positive and negative Step 1: Let σt be the strategy at iteration t. The instant re- (t/(t + 1))α (t/(t + 1))β t regret, which are and respectively gret r (I, a) on iteration t for the action a in the information (where t is the number of the iteration). [17] introduced a set I is formally defined as follows: dynamic thresholding for CFR in which a threshold is set at t t rt(I, a) = vσ (I, a) − vσ (I) (1) each iteration such that the probability of actions below the threshold is set to zero. where vσ(I) is conterfactual value, which represents the Actually, the essence of CFR is that it is an iterative expected payoff of player i. vσ(I) is weighted by the prob- strategy, so it becomes progressively more accurate with ability that the palyer i would reached I if the player tried an increase in iterations. The strategy affects the regret to do so that iteration when reaching the information set I. value, which decreases as the number of iterations increases. The conterfactual value vσ(I) is defined as: Therefore, the ultimate goal of both the vanilla CFR and ! several of its improved variations is to accelerate conver- σ X σ X σ gence, that is, to improve the speed of convergence. In this v (I) = π−i(h) (π (h, z)ui(z)) (2) h∈I z∈Z paper, following with the same purpose, we propose an exponential weighting technique, which tries to make the u (z) i z where i is the payoff of the player in the leaf node . strategy converge faster by reweighting the regret value. a And the counterfactual value of the action is defined as: In CFR-based methods, different values for regret indi- ! σ X σ X σ cate that the corresponding level of importance is different. v (I, a) = π−i(h) (π (h · a, z)ui(z)) (3) The regret value of an action represents the advantage of h∈I z∈Z this action compared with the expected action of current The counterfactual regret RT (I, a) for the action a in the strategy. This means it would be beneficial to focus on information set I on T iterations is defined as: actions with higher regret values by giving them higher T weights. X RT (I, a) = rt(I, a) (4) In the method, an exponential weighting technique is t=1 proposed, which applies an exponential weight to actions with a higher level of regret. This makes the iterative al- RT (I, a) = max RT (I, a), 0 Step 2: We define + . The gorithm pay more attention to actions that incur a higher CFR algorithm updates its strategy iteratively through the regret and so improves the strategy accordingly. Its formal regret matching algorithm (RM) on each information set. description is as follows: In RM, a player picks a distribution over actions in an  α information set in proportion to the positive regret of those e x, if x > 0 f(x) = α (6) actions. Formally, on iteration T + 1, the player selects e β, if x ≤ 0 a ∈ A(I) actions according to the probabilities: where α is a parameter that is closely related to the variable  RT (I,a) x, β is a parameter with a small value that will be discussed  + P T 0 T +1 P RT (I,a0) , if a0 R+ (I, a ) > 0 σ (I, a) = a0∈A(I) + in the Experiment section below, and f(x) is the output. 1  |A(I)| , otherwise To be more specific, the variable x of Eq.6 may be a (5) negative value in the process of solving games with CFR. In The regret matching algorithm is able to guarantee that contrast to conentional methods that set the negative value its level of regret will decrease over time, so that it will to 0, in our method we set the variable with a negative value eventually achieve the same effect as a Nash equilibrium to a new minimum value eα ∗ β. This is because the strategy strategy. obtained in the early stages is not yet accurate enough, and JOURNAL OF IEEE INTELLIGENT SYSTEMS 4

some actions with a negative regret value are still worth Algorithm 1 The ECFR t considering. In the early stages of the strategy iteration, it is Input: The game G, the strategy σi for each player, the t t unreasonable to ignore actions with a negative regret when regret ri (I, a) and Ri(I, a), β, iteration T . t+1 updating strategies. Therefore, the variable with a negative Output: The strategy σi (I) of the next iteration, the T value is set to a new minimum value in our method. average strategy σ¯i (I). 1 1 Initilize each player’s strategy σi , the regret ri (I, a) and 1 Ri (I, a). 3.2 Exponential Counterfactual Regret Minimization 1: for ECFR iteration t = 1 to T do We propose a novel CFR-based variant, known as Expo- 2: Env = G nential Counterfactual Regret Minimization (ECFR). Our 3: for each player i do t t method is based on vanilla CFR, which redistributes the 4: Traverse(∅, i, σ1(I), σ2(I), t) t+1 weight of instantaneous regret values through the exponen- 5: Compute the strategy of t + 1 iteration σ (I) with tial weighting technique introduced in the last section. regret matching Eq.9. As indicated in Eq.6, the parameter α in the exponen- T 6: Compute the average strategy σ¯i (I) with Eq.11. tial weighting method is closely related to the variable x. t+1 T 7: return σi (I), σ¯i (I) Specifically in ECFR, we define a loss function, which can be regarded as the parameter α in the Eq.6. In addition, we regard the instantaneous regret value on each iteration as ∞ For regret matching [19], it proved that if P w = the variable x in the exponential weighting technique. The t=1 t t ∞ then the weighted average regret, which is defined as instant regret ri (I, a) is filtered with the mean value EVI , PT t w,T t=1(wtr (a)) which makes the strategy of next iteration focuses on the Ri = maxa∈A PT t is bounded by: t=1 w more advantageous actions by giving them higher weights. q The loss function can then be defined as follows: p PT 2 w,T ∆ |A| t=1 wt Ri ≤ (10) t PT w L1 = ri (I, a) − EVI (7) t=1 t

t The work of [20] has shown that the weighted average where ri (I, a) has the same definition as Eq.1 in Sect.2.3 P  σt t  w,T t∈T wtπi (I)σi (I) that is, the immediate regret value of action a for player strategy σi (I) = P σt is a 2-Nash equi- (wtπ (I)) i t EV t∈T i on iteration . I is the average counterfactual regret librium if the weighted average regret is  in two-player EV = 1 P r(I, a) A(I) value on each iteration, I |A(I)| a∈A(I) , zero-sum games. Our method can also obtain a similar represents the legal actions on information set I. theoretical guarantee that the average strategy of players Following the approach of vanilla CFR, at each iteration, computed with ECFR will eventually converge to a Nash ECFR aims to minimize the total regret value by minimizing equilibrium. The average strategy of the ECFR is as follows: the regret on each information set. In contrast to vanilla T t CFR, ECFR uses a particular weight for the calculation of P eL1 πσ (I)σt(I, a) σ¯T (I, a) = t=1 i i T t (11) the immediate regret value. With iterations increase, ECFR P L1 σ t=1 e πi (I) pays more attention on the actions with higher instant regret values by introducing L1 loss, which is weighted in The detailed proof of the convergence will be given in exponential form. The regret for all actions a ∈ A(I) on each the next section. The ECFR is described in algorithm 1 and information set I, R(I, a) can be calculated as follows: algorithm 2.

( T P eL1 rt(I, a), if rt(I, a) > 0 RT (I, a) = t=1 i i 3.3 Proof of Convergence i,ECF R PT L1 t t=1 e β, if ri (I, a) ≤ 0 (8) A brief but sufficient theoretical proof of convergence for where β is a parameter that will be set in the following ECFR is given in this section. As described in the previous section Sect.4.2.2. section, our ECFR method proposes an RM based algorithm Then the strategy for iteration T + 1 can be computed to adjust the average strategy in which the regret value is with a regret matching algorithm (RM) as follows: given different weights. Here we will prove that there is a convergence bound for the ECFR average strategy, and the L1 T e Ri,ECF R(I, a) bound is never higher than that of vanilla CFR. σT +1(I, a) = (9) i P L1 T 0 Theorem 1 Assume that the number of iterations is 0 e R (I, a ) a ∈A(I) i,ECF R T and that ECFR is conducted as a two-player zero-sum T P T game. Then the weighted average strategy profile is a In vanilla CFR, the total regret R ≤ R (I) if √ 2 i I∈Ii ∆|I| |A|e2T −T i √ player plays according to CFR on each iteration. Thus, T Nash equilibrium. RT i Proof. The lowest amount of instantaneous regret on as T → ∞, then T → 0 [4]. If the average regret of T Ri any iteration is −∆. Consider the weighted sequence of both players satisfies T ≤ , then their average strategy 01 0T 0t t T T iterations σ , . . . , σ , where σ is identical to σ , but the σ¯1 , σ¯2 will be a 2-Nash equilibrium in a two-player zero- T −1 i (T +t−1)(T −t) w = Q e = e 2 sum game [18], where the average strategy is updated as weight is a,t i=t rather than  t  QT −1 L1 0t PT πσ (I)σt(I) wa,t = i=t e . R (I, a) is the regret of action a on T t=1 i i σ¯i (I) = PT σt . information set I at iteration t, for this new sequence. t=1 πi (I) JOURNAL OF IEEE INTELLIGENT SYSTEMS 5 w,T Algorithm 2 Traverse the game tree with ECFR Through Eq.10 and Lemma 2, we find that Qi ≤ √ √ √ 2 t t ∆ |A| PT w2 ∆ |A|e2T −T function Traverse(h, i, σ1(I), σ2(I), t) t=1 t ≤ √ RT ≤ t t PT w . Due to i h i σ (I) σ (I) t=1 t T √ Input: The history , player , strategies 1 and 2 , 2T −T 2 w,T ∆|Ii| |A|e ECFR iteration t. P RT (I) [21], we find that Q ≤ √ . I∈Ii i √ T 2T −T 2 1: if h is a terminal node then w,T w,T w,T ∆|Ii| |A|e Since R ≤ Q , thus R ≤ √ . 2: return the payoff of the player i i i i T 3: else if h is a chance node then It can be found that the regret bound of our method is never higher than that of the vanilla CFR. We give a 4: a ∼ σ(h) p 2 √ t t brief analysis on the regret bound. ∆ |A|e2T −T / T is the 5: return Traverse(h, i, σ1(I), σ2(I), t) p √ 6: else if i is the acting player then regret bound of our method ECFR, and ∆ |Ii| |Ai|/ T is 7: for a ∈ A(h) do the regret bound of the vanilla CFR [4]. RECFR and RCFR t t are used to represent these two regret bounds respectively. 8: v(I, a) ← Traverse(h, i, σ1(I), σ2(I), t) B Tra- √ √ p 2T −T 2  p  verse each action RECFR/RCFR = ∆ |A|e / T/ ∆ |Ii| |Ai| / T 9: for a ∈ A(h) do 2T  T 2  2T  T 2  2T T 2 t P t 0 0 = e / e |Ii| . And e / e |Ii| ≤ e /e , since |Ii| 10: r (I, a) ← v(I, a) − 0 σ (I, a ) · v(a ) a ∈A(h) is the number of the information set of player i, which is an T 2 2 11: Compute the regret Ri,ECF R(I, a) with Eq.8 integer and not less than 1. For e2T /eT , e2T /eT < 1 when 12: else B If the opponent is the acting player T > 2, T is the number of iteration that is an integer and T 13: Compute the regret R−i,ECF R(I, a) with Eq.8 not less than 1. And in general, the number of iterations is t t 14: return Traverse(h, i, σ1(I), σ2(I), t) much greater than 2. Thus, we can conclude that the regret bound of our method is never higher than that of the vanilla CFR. In addition, for the regret matching, [21] proves that if P∞ t=1 wt = ∞, then the weighted average regret, defined 3.4 Differences between ECFR and other CFR-based PT w rt(a) w,T t=1( t ) Methods as Ri = maxa∈A PT t is bounded by Eq.10 t=1 w depicted in section 3.2. In this section we analyze the differences between our ECFR √ 2 ∆ |A|e2T −T method and the other three major CFR-based methods that We can find that Rt(I, a) ≤ √ for player T also aim to speed up the convergence of vanilla CFR. These i’ performing action a on information set I, from Lemma are, LCFR [10], DCFR [8] and dynamic thresholding for CFR 3. We can use Lemma 1, which uses the weight wα,t for √ 2 [17] which we will call dynamic CFR. ∆ |A|e2T −T t B = √ C = 0 Both of LCFR and DCFR improve upon CFR by applying iteration with T and . This means √ 2 ∆ |A|e2T −T a reweighting strategy. DCFR [8] is implemented by dis- R0t(I, a) ≤ w (B − C) ≤ √ that T T from Lemma counting the immediate regret value of each iteration. For 1. Furthermore,√ we find that the weighted average regret a positive immediate regret value, the regret is multiplied 2T −T 2 α ∆|Ii| |A|e t is at most √ from Lemma 3 because, for the by a weight of tα+1 . For a negative immediate regret value, T tβ information set, |I1| + |I2| = |I|, the weighted average the regret is multiplied by the weight of tβ +1 . In addition, √ 2 γ ∆|I| |A|e2T −T  t  √ in DCFR, the average strategy is multiplied by to strategies form a T -Nash equilibrium (with t+1 √ 2 ∆|I| |A|e2T −T obtain the final strategy. After the above three forms of T √ the iteration increasing, T approaches zero). discount, the regret value at each iteration and the final Lemma 1. Call a sequence x1, . . . , xT of bounded real strategy are updated by the reallocated weight. LCFR [10] Pi values BC-plausible if B > 0,C ≤ 0, t=1 xt ≥ C for all i, uses the iteration t to weight the immediate regret value. PT and t=1 xt ≤ B. For any BC-plausible sequence and any That is to say, the regret value is weighted by the iteration PT sequence of non-decreasing weights wt ≥ 0, t=1 (wtxt) ≤ t at each iteration as the number of iterations increases. wT (B − C). Dynamic CFR [17], speeds up the convergence by pruning Lemma 2. Given a group of actions A and any sequence parts of the decision tree using dynamic thresholding. of rewards vt, such that |vt(a) − vt(b)| ≤ ∆ for all t and Although our ECFR also reweights the regret value on all a, b ∈ A, after conducting a set of strategies decided by each iteration, it is different from DCFR and LCFR in both regret matching, apply the regret-like value Qt(a) instead of the overall concept and in the practical implementation. p Rt(a),QT (a) ≤ ∆ |A|T for all a ∈ A. First of all, our approach comes from the intuitive idea that Proof. This lemma closely resembles Lemma 1, and both whichever form of weight is given to the regret values, the are from [22], thus here we do not give the detailed proof of action with a higher regret value will be given a larger prob- these two lemmas. ability, where the ultimate goal is to further accelerate the Lemma 3. Suppose player i conducts T iterations based convergence of the strategy by reweighting the regret value. on ECFR, then the weighted regret for player i is at most Secondly, in terms of implementation details, our method p √ ∆ |Ii| |A| T , and√ the weighted average regret for player reallocates the weight via the exponential form of the L1 2T −T 2 ∆|Ii| |A|e loss, which is different from DCFR, which is implemented i is at most √ . T by discounting based on the number of iterations. Therefore, QT −1 L1 Proof. The weight on iteration t < T is wt = i=t e our method is distinct from DCFR, and LCFR. PT 2 and wT = 1. Therefore, for all iterations t, t=1 wt ≤ In Dynamic CFR, the exponential weight is adopted in 4T 2 PT T (T +1) T 2 T e . In addition, t=1 wt ≥ T e 2 ≥ T e . calculating the strategy of next iteration and a hedging JOURNAL OF IEEE INTELLIGENT SYSTEMS 6

∗ ∗ algorithm is used to minimize regret. However, our ECFR ui(σi ,BR(σi )) − ui(σi,BR(σi)), which determines how approach is quite different from dynamic CFR in six aspects. close σ is to an equilibrium, where a lower exploitability Firstly, our approach uses exponential weight to calculate indicates a better strategy. When the exploitability is zero, the cumulative regret and the next iteration strategy, while this means that the strategy cannot be beaten by any other dynamic CFR only uses exponential weight in calculating strategy. the strategy of next iteration. Secondly, our approach gives Two groups of experiments were conducted. The first a small value to the negative immediate regret because we group aimed to verify the effectiveness of our method believe that the actions that give rise to negative regret in the on the three different games. The second group was an early stages are also relevant, while dynamic CFR only deals ablation study, which analyzed the effect of the results to with positive regret. Thirdly, the exponential weight is only the parameter settings. used in dynamic CFR when using the hedging algorithm to minimize the regret. Conversely, our approach uses a regret 4.2.1 Comparison with State-of-the-art Methods matching algorithm as the regret minimization method for We conducted the first group experiments with four state- the strategy iteration. Fourthly, dynamic CFR does not tra- of-the-art methods, which are CFR [4], CFR+ [6], LCFR verse all nodes in the game tree, but prunes the nodes below [10], and DCFR [8] respectively. We found that, for all the a threshold to accelerate the convergence. Our approach methods, when the number of iterations reached 10,000, traverses all the nodes in the game tree, and accelerates con- the reduction in exploitability became very small. Thus, we vergence by redistributing the weight of regret. Fifthly, the limited the tests to 10,000 iterations, which we believe is parameter settings for exponential weighting√ are different sufficient to demonstrate the effectiveness of each method. ln(|A(I)|) in the two methods. In dynamic CFR, √ √ is set The experimental results are shown in Fig. 2. Note that 3 VAR(I)t t the sub-figures in the left hand column show the overall (VAR(I)t is the observed variance of v(I) up to iteration t), i progress of the experiments, while the sub-figures in the while in our approach L1 is set (L1 = r(I,a) − EVI ). Finally, right hand column highlight some details of the sub-figures from the theoretical analysis, the regret boundary of the two in the other column. methods is different.√ The regret boundary√ of dynamic CFR As shown in Fig. 2, four methods were tested in the T p is R (I) ≤ C 2∆(I) ln(|A(I)|) T (C ≥ 1 on every experiments. CFR [4], CFR+ [6], LCFR [10], and DCFR [8] iteration t), while the regret boundary in our method is √ 2 were used for comparison with our ECFR approach. CFR ∆ |A|e2T −T RT (I) ≤ √ . [4] was the first method to solve the strategy through regret T matching in IIGs. CFR+ uses regret-matching+ to update the strategy of the next iteration, which converges faster than 4 EXPERIMENT CFR. LCFR and DCFR are both CFR-based methods which 4.1 Experimental Setup reweight regrets from iterations in various ways. Among In recent years, the game of poker has been widely used them, we chose the parameters α = 1.5, β = 0, and γ = 2 in to test the performance of CFR-based methods because it the DCFR, which were the optimal parameters given in [8]. contains all the elements of an IIG. In this paper, we compare First, we analyzed the convergence of our approach. In our ECFR with other CFR-based methods in the context of Fig. 2 (especially in the subfigures on the left), we found three different poker games, which are Kuhn, Leduc, and that our ECFR approach (indicated by the solid red line) Royal. These three poker games are all simplified versions always ended up close to zero in the three test games. In of Texas Hold’em poker and a popular benchmark in the addition, from the overall trend of the curve, we also find IIG. We chose them as the test platform, because they are that, with increasing iterations, ECFR shows a similar trend large enough to be highly nontrivial but small enough to to the other four methods. That is, the curve presents a be solvable. It can conveniently evaluate the performance of downward trend. Moreover, the convergence of the ECFR the algorithm. has been proven theoretically earlier in Sec.3.3. Therefore, It is worth noting that the three kinds of poker we used the convergence of our approach has been verified from in the experiment are two-player games. Among these three both experimental and theoretical perspectives. kinds of poker, Kuhn poker is the simplest. Kuhn poker Fig. 2a shows the experimental results for Kuhn poker. has three cards in total, only one round, each player has It can be seen that ECFR performs better than the other one hand, and there are no public cards. Leduc poker has methods overall, although DCFR performs better than ECFR t = 82 ∼ 88, 114 ∼ six cards and operates over two rounds. Each player has over some ranges of iterations, such as 120, 182 ∼ 185, 241 ∼ 248 a private hand in the first round and a public card in . However, the convergence of the second round. There are eight cards in Royal poker, LCFR appears to be unstable and fluctuates a lot. In contrast, and there are three rounds. In the first round, each player our approach performs better than the other methods in receives a private card, and a public card is issued in the most iterations, and the performance is relatively stable. second and third rounds. Tab. 2 gives some more details of Fig. 2b shows the experimental results for Leduc poker. t = 73 ∼ 76, 118 ∼ 122 the three types of poker. It can be seen that in iterations , the exploitability of ECFR (the red curve) converges slightly slower than that of DCFR and LCFR. However, apart from 4.2 Experimental Results these limited number of iterations, our method appears to Exploitability is a standard evaluation metric used to mea- converge faster than the other methods. sure the effectiveness of a strategy solved by an algorithm. Fig. 2c shows the experimental results for Royal poker. The exploitability of a strategy is defined as: e(σi) = It can be seen clearly from the figure that the performance JOURNAL OF IEEE INTELLIGENT SYSTEMS 7

TABLE 2 Details of the three kinds of poker

Poker Total cards Public cards Private cards Round Ante Betsize Kuhn 3 0 1 1 1 1 Leduc 6 1 1 2 1 2; 4 Royal 8 2 1 3 1 2; 4; 4

0.30 ECFR 0.08 ECFR CFR CFR 0.07 0.25 CFR+ CFR+ LCFR LCFR 0.06 0.20 DCFR DCFR 0.05 0.15 0.04 Exploitability 0.10 0.03

0.05 0.02

0.00 0.01 1 2 3 4 2 3 10 10 10 10 10 10 Iteration Iteration (a) Tested on Kuhn

ECFR 0.30 ECFR 1.2 CFR CFR CFR+ CFR+ 0.25 1.0 LCFR LCFR DCFR DCFR 0.8 0.20

0.6

Exploitability 0.15 0.4

0.10 0.2

0.0 1 2 3 4 2 3 10 10 10 10 10 10 Iteration Iteration (b) Tested on Leduc

0.08 ECFR ECFR 0.7 CFR CFR 0.07 CFR+ CFR+ 0.6 LCFR 0.06 LCFR 0.5 DCFR DCFR 0.05 0.4 0.04 0.3 Exploitability 0.03 0.2

0.02 0.1

0.0 0.01 1 2 3 4 2 2 2 2 2 10 10 10 10 10 2 × 10 3 × 10 4 × 10 6 × 10 Iteration Iteration (c) Tested on Royal

Fig. 2. Comparison with state-of-the-art methods. (The Y-axis represents the exploitability, and the X-axis represents the number of iterations. The lower of the exploitability, the better the strategy is. The subfigures in the left hand column are the overall curve of the experimental results. The subfigures in the right column are a more detailed representation of parts of the subfigures in the left column, in order to better display the experimental results.) of our method is better than that of the other comparison earlier. methods. The red curve is always at the bottom compared To sum up, three games were used to test the effective- with the other curves, which indicates that it is converging ness of our method. In terms of convergence, the experi- JOURNAL OF IEEE INTELLIGENT SYSTEMS 8

mental results show that our method converges reliably. In to build the approximate Nash equilibrium strategy of an terms of the rate of convergence, the experimental results extensiveform imperfect information game. We introduce show that our method can speed up the convergence of the exponential weighting technique for regret in the process the strategy, which shows a better performance than the of iteration, and provide a detailed theoretical proof of con- other comparison methods. Therefore, results fully verify vergence. Extensive experiments were then conducted on the effectiveness of our ECFR method. three kinds of game. Under the same number of iterations, the exploitability of the strategy obtained by our method 4.2.2 Ablation Study is the lowest. The results demonstrate that our method not The ablation study was conducted in the second group of only has a good convergence, but also converges faster than experiments, which includes two aspects. First, we analyzed current state-of-the-art methods. the sensitivity of the results to the different parameter settings. Secondly, we verified the effect of parameter β by REFERENCES setting with/without β in ECFR. [1] D. Fudenberg and D. K. Levine, “The theory of learning in games,” For β in Eq.4, several different settings were tested, Mit Press Books, vol. 1, 1998. specifically β: ±1, ±0.1, ±0.01, ±0.001, ±0.0001, ±0.00001; [2] R. B. Myerson, : Analysis of Conflict. Harvard Univer- 2 3 1 1 1 2 3 sity Press, 1997. r, r , r ; ± , ± 2 , ± 3 ; tr, tr , tr , where r is the t t t [3] J. Nash, “Non-cooperative games,” Annals of mathematics, pp. 286– instantaneous regret for each action, and t is the 295, 1951. number of iterations. The results are shown in Fig. 3a. [4] M. Zinkevich, M. Johanson, M. Bowling, and C. Piccione, “Regret We start with a rough selection of four different sets of minimization in games with incomplete information,” in Advances values and then make a further, more detailed selection. in neural information processing systems, 2008, pp. 1729–1736. [5] M. Lanctot, K. Waugh, M. Zinkevich, and M. Bowling, “Monte For the fine-tuning selection, the value of β was set to: 2 2 carlo sampling for regret minimization in extensive games,” in 2 r r Advances in neural information processing systems, 2009, pp. 1078– −0.008, −0.009, −0.0001, −0.00011, −0.00012, −r , − t , − t2 , and the results are shown in Fig. 3b. Considering that we 1086. [6] M. Bowling, N. Burch, M. Johanson, and O. Tammelin, “Heads- are only choosing the optimal parameter settings here, two up limit hold’em poker is solved,” Science, vol. 347, no. 6218, pp. games (Kuhn and Leduc) were used to test the settings. The 145–149, 2015. number of iterations was set to 1,000. [7] M. Johanson, N. Bard, M. Lanctot, R. Gibson, and M. Bowling, “Efficient nash equilibrium approximation through monte carlo As shown in Fig. 3a, we find that β = −0.0001, β = −r2, 3 counterfactual regret minimization,” in Proceedings of the 11th Inter- β = −0.00001, and β = r have a good performance in the national Conference on Autonomous Agents and Multiagent Systems- Kuhn poker game. In Leduc, β = −0.0001 and β = −r2 Volume 2, 2012, pp. 837–846. produce a good performance, but β = −0.00001 and β = r3 [8] N. Brown and T. Sandholm, “Solving imperfect-information 2 games via discounted regret minimization,” in Proceedings of the perform worse than β = −0.0001 and β = −r . AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 1829– 2 On the basis of these results, β = −0.0001 and β = −r 1836. were chosen for further optimization. In order to fine- [9] H. Li, K. Hu, Z. Ge, T. Jiang, Y. Qi, and L. Song, “Dou- β β = −0.0001 ble neural counterfactual regret minimization,” arXiv preprint tune the appropriate value for , we set , arXiv:1812.10607, 2018. 2 r2 −0.008, −0.009, −0.00011, −0.00012. Also, β = −r , − t [10] N. Brown, A. Lerer, S. Gross, and T. Sandholm, “Deep counter- r2 factual regret minimization,” in International Conference on Machine and − 2 were added for comparison. The results are t Learning, 2019, pp. 793–802. shown in Fig. 3b. In the Kuhn poker game, we found that [11] M. Schmid, N. Burch, M. Lanctot, M. Moravcik, R. Kadlec, and β = −r2 performs the best when compared against the M. Bowling, “Variance reduction in monte carlo counterfactual other settings in Fig. 3b. In the Leduc poker game, although regret minimization (vr-mccfr) for extensive form games using baselines,” in Proceedings of the AAAI Conference on Artificial In- β = −0.0001, 0.00011, 0.00012 perform better in the first telligence 2 , vol. 33, 2019, pp. 2157–2164. 400 iterations, the performance of β = −r gradually [12] N.Brown and T.Sandholm, “Superhuman ai for heads-up no-limit exceeds the others after that point. Therefore, β = −r2 was poker: Libratus beats top professionals,” Science, vol. 359, no. 6374, selected as the final setting in the comparison experiments. p. 1733, 2017. [13] M. Moravcˇ´ık, M. Schmid, N. Burch, V. Lisy,` D. Morrill, N. Bard, In addition, we also analyzed the effect of β to the T. Davis, K. Waugh, M. Johanson, and M. Bowling, “Deepstack: algorithm performance. Based on the analysis of Section Expert-level artificial intelligence in heads-up no-limit poker,” Sect.3.1, we think the β can contribute to the convergence Science, vol. 356, no. 6337, pp. 508–513, 2017. [14] N. Brown and T. Sandholm, “Superhuman ai for multiplayer of the strategy. Here we conducted the experiment to verify poker,” Science, vol. 365, no. 6456, pp. 885–890, 2019. the effect of β by setting with/without β in ECFR. To better [15] B. Noam and T. Sandholm, “Safe and nested solving for reflect fairness, we set the number of iterations to 10000. imperfect-information games,” in Proceedings of the AAAI Confer- The results are shown in Fig. 4. In the Kuhn poker, we ence on Artificial Intelligence, 2017, pp. 295–303. [16] M. J. Osborne and A. Rubinstein, “A course in game theory,” The found that the performance of ’with β’ completely exceeds MIT Press, 1994. ’without β’ after the first 100 iterations. In the Leduc poker, [17] B. Noam, K. Christian, and S. Tuomas, “Dynamic thresholding although ’without β’ performs better in the first 750 itera- and pruning for regret minimization,” in Proceedings of the AAAI β Conference on Artificial Intelligence, 2017, pp. 421–429. tions, the performance ’with ’ gradually exceeds the other [18] K. Waugh, “Abstraction in large extensive games,” Master’s thesis, after that point. The experimental results fully verify that University of Alberta, 2009. the setting of parameter β is effective in the ECFR. [19] Sergiu, Hart, Andreu, and Mas-Colell, “A simple adaptive proce- dure leading to ,” Econometrica, 2000. [20] B. Noam and T. Sandholm, “Regret transfer and parameter op- 5 CONCLUSION timization,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2014. In this paper we proposed an exponential counterfactual [21] N. Cesa-Bianchi and G. Lugosi, Prediction, Learning, and Games, regret minimization algorithm named ECFR. It can be used 2006. JOURNAL OF IEEE INTELLIGENT SYSTEMS 9

Tested on the Kuhn

= 1 = 0.1 1 10 = 0.01 = 0.001 = 0.0001 = 0.00001 Exploitability = -1 = -0.1 = -0.01 = -0.001 = -0.0001 = -0.00001 = r 2 3 2 10 10 = -r Tested on the Leduc = r3 = 1/t = 1/t2 = 1/t3 = -1/t = -1/t2 = -1/t3 = t * r 0 10 = -t * r2 = t * r3 Exploitability

1 10

2 3 10 10 Iteration (a) Multiple groups of different parameter settings

Tested on the Kuhn Tested on the Leduc = -0.0001 = -0.009 = -0.008 4 × 10 2 = -0.00011 = -0.00012 = -(1/t)r2 3 × 10 2 = -(1/t2)r2 = -r2

2 × 10 2 Exploitability

1 10

2 10

2 3 2 3 10 10 10 10 Iteration Iteration (b) Fine-tuning parameter settings

Fig. 3. Ablation study on the different parameter settings. (The Y-axis represents the exploitability, and the X-axis represents the number of iterations. The smaller of the exploitability, the better. For parameter setting, we also conducted two groups of experiments. First, we selected a group of optimal settings from several groups of parameters, and then further refined the selected values.)

[22] O. Tammelin, N. Burch, M. Johanson, and M. Bowling, “Solving heads-up limit texas hold’em,” in Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015. JOURNAL OF IEEE INTELLIGENT SYSTEMS 10

(a) Tested on the Kuhn (b) Tested on the Leduc 4 × 10 2 with with 6 × 10 2 without without 3 × 10 2

2 × 10 2 4 × 10 2

3 × 10 2

Exploitability 2 10 2 × 10 2

6 × 10 3

3 4 3 4 10 10 10 10 Iteration Iteration

Fig. 4. Ablation study on the parameter with/without β. (The Y-axis represents the exploitability, and the X-axis represents the number of iterations. The smaller of the exploitability, the better.)