Self-Play Monte-Carlo Tree Search in Computer Poker

Computer Poker and Imperfect Information: Papers from the AAAI-14 Workshop Self-Play Monte-Carlo Tree Search in Computer Poker Johannes Heinrich and David Silver University College London Gower Street, London, UK j.heinrich,d.silver @cs.ucl.ac.uk { } Abstract (Lanctot et al. 2009), these are generally full-width methods and therefore are particularly exposed to the curse of di- Self-play reinforcement learning has proved to be suc- mensionality. This might prevent them from scaling to larger cessful in many perfect information two-player games. games, e.g. multi-player poker. In addition, some methods However, research carrying over its theoretical guarantees and practical success to games of imperfect infor- are undefined for games with more than two players, e.g. mation has been lacking. In this paper, we evaluate self- the excessive gap technique (Hoda et al. 2010), while other play Monte-Carlo Tree Search (MCTS) in limit Texas approaches lose their theoretical guarantees, e.g. CFR. In Hold’em and Kuhn poker. We introduce a variant of spite of this, CFR has achieved strong performance in three- the established UCB algorithm and provide first empiri- player limit Texas Hold’em (Risk and Szafron 2010). The cal results demonstrating its ability to find approximate MCTS methods discussed in this paper are well defined Nash equilibria. for any number of players but do not yet have any established theoretical convergence guarantees even for two- Introduction player zero-sum imperfect information games. Reinforcement learning has traditionally focused on station- MCTS is a simulation-based search algorithm that has ary single-agent environments. Its applicability to fully ob- been successful in high-dimensional domains, e.g. computer servable multi-agent Markov games has been explored by Go (Gelly et al. 2012). Selective sampling of trajectories (Littman 1996). Backgammon and computer Go are two of the game enables it to prioritise the most promising re- examples of fully observable two-player games where regions of the search space. Furthermore, it only requires a inforcement learning methods have achieved outstanding black box simulator and might therefore plan in complex performance (Tesauro 1992; Gelly et al. 2012). Computer environments, e.g. with approximate generative models of poker provides a diversity of stochastic imperfect informa- real world applications. Finally, it is built on principles of tion games of different sizes and has proved to be a fruitful reinforcement learning and can therefore leverage its well- research domain for game theory and artificial intelligence developed machinery. For example, function approximation (Sandholm 2010; Rubin and Watson 2011). Therefore, it is and bootstrapping are two ideas that can dramatically im- an ideal research subject for self-play reinforcement learn- prove the efficiency of learning algorithms (Tesauro 1992; ing in partially observable multi-agent domains. Silver, Sutton, and Muller¨ 2012). Game-theoretic approaches have played a dominant role (Ponsen, de Jong, and Lanctot 2011) compared the quality in furthering algorithmic performance in computer poker. of policies found by MCTS and MCCFR in computer poker. The game is usually formulated as an extensive-form game They concluded that MCTS quickly finds a good but subop- with imperfect information and most attention has been di- timal policy, while MCCFR initially learns more slowly but rected towards the two-player Texas Hold’em variant. Ab- converges to the optimal policy over time. stracting the game (Billings et al. 2003; Gilpin and Sand- holm 2006), by combining similar nodes, reduces the size of In this paper, we extend the evaluation of MCTS-based the game tree and has allowed various approaches to com- methods in computer poker. We introduce a variant of UCT, pute approximate Nash equilibria. Common game-theoretic Smooth UCT, that combines MCTS with elements of fic- methods in computer poker are based on linear program- titious play and provide first empirical results of its con- ming (Billings et al. 2003; Gilpin and Sandholm 2006), non- vergence to Nash equilibria in Kuhn poker. This algorithm smooth convex optimization (Hoda et al. 2010), fictitious might address the inability of UCT to converge close to a play (Ganzfried and Sandholm 2008) or counterfactual re- Nash equilibrium over time, while retaining UCT’s fast ini- gret minimization (CFR) (Zinkevich et al. 2007). Except for tial learning rate. Both UCT and Smooth UCT achieve high Monte-Carlo counterfactual regret minimization (MCCFR) performance against competitive approximate Nash equilibria in limit Texas Hold’em. This demonstrates that strong Copyright c 2014, Association for the Advancement of Artificial policies can be learned from UCT-based self-play methods Intelligence (www.aaai.org). All rights reserved. in partially observable environments. 19 Background to all policies in ⇡ except ⇡i. Ri(⇡) is the expected reward Extensive-Form Games of player i if all players follow the policy profile ⇡. The set of best responses of player i to his opponents’ policies Extensive-form games are a rich model of multi-agent in- i i i i i i ⇡− is b (⇡− ) = arg max⇡i ∆i R (⇡ , ⇡− ). For ✏ > 0, teraction. The representation is based on a game tree. Intro- i i i i i i 2 i i i i i b✏(⇡− )= ⇡ ∆ : R (⇡ , ⇡− ) R (b (⇡− ), ⇡− ) ducing partitions of the tree that constrain the players’ infor- ✏ defines the{ set2 of ✏-best responses≥ to the policy profile− mation about the exact position in the game tree allows to ⇡} i. add versatile structure to the game, e.g. simultaneous moves − and partial observability. Formally, based on the definition Definition 3. A Nash equilibrium of an extensive-form i i i of (Selten 1975), game is a policy profile ⇡ such that ⇡ b (⇡− ) for all 2 Definition 1. An extensive-form game with imperfect infor- i N. An ✏-Nash equilibrium is a policy profile ⇡ such that i2 i i mation consists of ⇡ b (⇡− ) for all i N. 2 ✏ 2 A set of players N = 1,...,n and a chance player c. • { } A set of states corresponding to nodes in a rooted game MCTS • tree with initialS state s . is the set of terminal 0 SZ ⇢ S states. MCTS (Coulom 2006) is a simulation-based search algo- A partition of the state space into player sets P i, i rithm. It is able to plan in high-dimensional environments by • N c . Each state s P i represents a decision node2 sampling episodes through Monte-Carlo simulation. These of player[ { } i and A(s) is the2 set of actions available to this simulations are guided by an action selection mechanism player in s. In particular, P c is the set of states at which that explores the most promising regions of the state space. chance determines the successor state. This guided search results in efficient, asymmetric search For each player a set of information states i and a surjec- trees. MCTS converges to optimal policies in fully observ- • tive information function i that assigns eachO state s P i able Markovian environments. to an information state o I i. For any information2 state A MCTS algorithm requires the following components. 2 O 1 o i an information set Ii =( i)− (o) is a set of Given a state and action, a black box simulator of the game states2 Os P i that are indistinguishableI for player i. samples a successor state and reward. A learning algorithm 2 uses simulated trajectories and outcomes to update some A chance function c(s, a)= (ac = a s = s), P t t statistics of the visited nodes in the search tree. A tree pol- • s P c, that determinesT chance events at chance| nodes. icy is defined by an action selection mechanism that chooses Transitions2 at players’ decision nodes are described by actions based on a node’s statistics. A rollout policy deter- their policies. i mines default behaviour for states that are out of the scope For each player a reward function R that maps terminal of the search tree. • states to payoffs. For a specified amount of planning time the algorithm re- For each player i N the sequence of his informa- 2 peats the following. It starts each Monte-Carlo simulation at tion states and own actions in an episode forms a history the root node and follows its tree policy until either reaching i i i i i i ht = o1,a1,o2,a2,...,ot . A game has perfect recall if a terminal state or the boundary of the search tree. Leav- { } i each player’s current information state ot implies knowledge ing the scope of the search tree, the rollout policy is used to i of his whole history ht of the episode. If a player only knows play out the simulation until reaching a terminal state. In this i some strict subset of his history ht, then the game has im- case, we expand our tree by a state node where we have left perfect recall. the tree. This approach selectively grows the tree in areas The following definition is similar to (Waugh et al. 2009) that are frequently encountered in simulations. After reach- but makes use of information functions. ing a terminal state, the rewards are propagated back so that Definition 2. Let Γ be an extensive-form game. An abstrac- each visited state node can update its statistics. tion of Γ is a collection of surjective functions f i : i Common MCTS keeps track of the following node val- A O ! ˜i that map the information states of Γ to some alternative ues. N(s) is the number of visits by a Monte-Carlo sim- O information state spaces ˜i. By composition of the abstrac- ulation to node s. N(s, a) counts the number of times action with the informationO functions i of Γ, we obtain alter- tion a has been chosen at node s.

Self-Play Monte-Carlo Tree Search in Computer Poker

Game Theory 10

An Essay on Assignment Games

Improving Fictitious Play Reinforcement Learning with Expanding Models

Using Counterfactual Regret Minimization to Create Competitive Multiplayer Poker Agents

Games Lectures.Key

Successful Nash Equilibrium Agent for a 3-Player Imperfect-Information Game

Game Theory 10

Online Enhancement of Existing Nash Equilibrium Poker Agents

ECE 586BH: Problem Set 4: Problems and Solutions Extensive Form Games

Heads-Up Limit Hold'em Poker Is Solved

Α-Rank: Multi-Agent Evaluation by Evolution

Optimistic Regret Minimization for Extensive-Form Games Via Dilated Distance-Generating Functions∗