Solving Imperfect-Information Games Via Exponential Counterfactual Regret Minimization
Total Page:16
File Type:pdf, Size:1020Kb
JOURNAL OF IEEE INTELLIGENT SYSTEMS 1 Solving imperfect-information games via exponential counterfactual regret minimization Huale Li, Xuan Wang, Shuhan Qi, Jiajia Zhang, Yang Liu, Yulin Wu, Fengwei Jia Abstract—In general, two-agent decision-making problems can be modeled as a two-player game, and a typical solution is to find a Nash equilibrium in such game. Counterfactual regret minimization (CFR) is a well-known method to find a Nash equilibrium strategy in a two-player zero-sum game with imperfect information. The CFR method adopts a regret matching algorithm iteratively to reduce regret values progressively, enabling the average strategy to approach a Nash equilibrium. Although CFR-based methods have achieved significant success in the field of imperfect information games, there is still scope for improvement in the efficiency of convergence. To address this challenge, we propose a novel CFR-based method named exponential counterfactual regret minimization (ECFR). With ECFR, an exponential weighting technique is used to reweight the instantaneous regret value during the process of iteration. A theoretical proof is provided to guarantees convergence of the ECFR algorithm. The result of an extensive set of experimental tests demostrate that the ECFR algorithm converges faster than the current state-of-the-art CFR-based methods. Index Terms—Decision-making, Counterfactual regret minimization, Nash equilibrium, Zero-sum games, Imperfect information. F 1 INTRODUCTION AME theory has often been regarded as a touchstone Moreover, the CFR-based methods have achieved no- G to verify the theory of artificial intelligence [1]. In gen- table success in the domain of IIGs, especially in poker eral, games can be divided into perfect information games games [6], [12], [13], [14]. Libratus is the first agent to beat (PIGs) and imperfect information games (IIGs), according top human players in heads-up no-limit Texas Hold’em to whether the player has complete knowledge of the game poker [12]. DeepStack also defeated professional poker play- state of all the other players. Examples of PIGs are Go and ers, and its method has been demonstrated to be sound with Chess, whereas games like poker are IIGs that players can- a theoretical proof [13]. Pluribus, the latest computer pro- not observe the complete game state of the other players. In gram based on CFR, defeated top poker players in six-player other words, compared with the PIGs, the IIGs usually holds no-limit Texas Hold’em poker, which can be recognized as certain private information. Recently IIGs have attracted a a milestone in the field of artificial intelligence and game great deal of attention from researchers, since IIGs are much theory [14]. more challenging compared with PIGs. In this paper, we Although the agents based on the CFR method have mainly focus on the study of decision-making for IIGs [2]. achieved conspicuous success, they are not directly improv- In order to solve a two-player zero-sum game with ing the vanilla CFR method itself. In other words, these imperfect information, a typical solution is to find a Nash agents are operated by combining CFR with other tech- equilibrium strategy [3]. As a popular method of com- niques (abstraction, neural network, etc.) to solve the game puting the Nash equilibrium strategy, counterfactual re- strategy. Moreover, vanilla CFR is an iterative strategy solv- gret minimization (CFR) [4] has attracted the attention of ing method that relies on increasing the number of training many researchers due to its sound theoretical guarantee iterations to improve the accuracy of the strategy. However, of convergence. Over the past decade, many variants of this means that finding a robust strategy may require a large arXiv:2008.02679v2 [cs.GT] 4 Dec 2020 CFR have been developed [4], [5], [6], [7], [8]. For example, number of iterations, which leads a significant amount of Monte Carlo counterfactual regret minimization (MCCFR) time for solving. This makes CFR-based methods difficult is a sample-based CFR algorithm, which combines Monte to be applied in some real situations. Therefore, it is worth Carlo methods with standard or vanilla CFR to compute investigating how to accelerate the convergence of the CFR approximate equilibria strategies in IIGs [5]. CFR+ uses a and obtain a robust strategy more efficiently. variant of regret matching (regret matching+) where regrets Towards this goal, we propose an exponential CFR are constrained to be non-negative [6]. Double neural CFR (ECFR), which speedups the CFR by focusing more atten- [9] and Deep CFR [10] combine deep neural networks with tion on the actions with greater advantage. Actually, the vanilla CFR and linear CFR (LCFR) respectively. Moreover, regret value of an action represents the advantage of this there are many other CFR variants, such as public chance action compared with the expected action of current strat- sampling CFR (PCCFR) [7], variance reduction in MCCFR egy. Therefore, we utilize average regret value as a threshold (VR-MCCFR) [11] and discounted CFR (DCFR) [8]. to filter the advantage action. Then, the action with different advantage will be weighted differently, i.e., the actions with higher advantage will be given higher weight, and vice • School of Computer Science and Technology, Harbin Institute of Technol- versa. In this way, the updated strategy will have a tendency ogy, Shenzhen, China,518055 to seclect the advantage action. And finally, the convergence E-mail: [email protected] of CFR is accelerated. JOURNAL OF IEEE INTELLIGENT SYSTEMS 2 To be specific, we adopt regret value as the metric to evaluate the advantage of actions. We propose an exponen- C tial weighting technique that applies an exponential weight to actions with a higher regret. This causes the iterative P1 P1 algorithm to pay more attention on actions with higher regret values and so accelerates the training process. Also, in contrast to traditional methods, actions with negative 0.5, -0.5 -0.5, 0.5 regret values are also considered, instead of setting their Information probabilities to zero. We summarize our contributions as Set P2 P2 Forfeit follows: Forfeit Payoff 1) We present an exponential weighting technique that -1, 1 1, -1 1, -1 1, -1 1, -1 -1, 1 is applied to vanilla CFR, called exponential CFR (ECFR), which makes the convergence of CFR more efficiently. Fig. 1. The game tree of the Coin Toss game. (”C” represents a chance 2) We give a proof of convergence for our ECFR, which, node. P1 and P2 are game players. A coin is flipped and lands either Heads or Tails with equal probability, but only player P1 can see the like CFR, provides it with a solid theoretical basis. outcome. The information set of P2 is the dotted line between the two 3) Three different games (Kuhn poker, Leduc poker and P2 nodes, which means that player P2 cannot distinguish between the Royal poker) are used to evaluate our ECFR. Exten- two states.) sive experimental results show that ECFR converges faster in the three kinds of games compared with the TABLE 1 Description of variables current state-of-the-art CFR-based methods. The rest of our paper is organized as follows. In Sect.2 Variable Meaning we introduce the concepts of extensive form game, describe u the utility function where ui represents the utility of Nash equilibrium, and CFR. Our proposed method is de- player i scribed in Sect.3, which includes details of the ECFR and its H a limited set of sequences, which is the possible set of proof of convergence. In Sect.4, we evaluate the performance historical actions of the ECFR on four varieties of IIG. Finally, in Sect.5, we σ the strategy where σ is the strategy of player i, and σ is present the conclusions of the paper. i −i the strategy of the other player Ii the information set of player i 2 NOTATION AND PRELIMINARIES πσ(h) the joint probability of reaching h if all players play σ In this section, some notation and definitions of extensive- according to σ, πi (h) is the probability of reaching h form games are introduced. Then, the concept of Nash if player i operates according to σ equilibrium is described. Finally, an overview of CFR is provided. limited set of sequences that represent the possible historical 2.1 Extensive-Form Game Model actions. P is the player function. P (h) is the player who takes action a after history h. If P (h) = c, then chance The extensive-form game is a classical model for sequential determines the action after history h. fc is a function that decision-making. It is particularly useful in the field of IIGs, associates every history h, and fc(a j h) is the probability in which some decision-makers have private information that action a occurs given history h. I is the information set. to each other. Before presenting the formal definition, we For any information set Ii belonging to player i, all nodes consider a specific example of the Coin Toss game. Fig. 1 0 h, h 2 Ii are indistinguishable to player i. u is a utility represents a game tree for the game of Coin Toss. function for every termination state. Due to the number of As shown in Fig. 1, each node represents a game state symbols in this paper, we give a brief list of variables for within the game tree. A leaf node, which is also known as further reference in Table 1. a terminal node, indicates that the game has ended. The corresponding payoff is returned after the game ends. In addition, the edge between two nodes represents an action 2.2 Nash Equilibrium or decision taken by a game player.