Computer and Imperfect Information: Papers from the AAAI-14 Workshop

Self-Play Monte-Carlo Tree Search in Computer Poker

Johannes Heinrich and David Silver University College London Gower Street, London, UK j.heinrich,d.silver @cs.ucl.ac.uk { }

Abstract (Lanctot et al. 2009), these are generally full-width meth- ods and therefore are particularly exposed to the curse of di- Self-play reinforcement learning has proved to be suc- mensionality. This might prevent them from scaling to larger cessful in many two-player games. games, e.g. multi-player poker. In addition, some methods However, research carrying over its theoretical guaran- tees and practical success to games of imperfect infor- are undefined for games with more than two players, e.g. mation has been lacking. In this paper, we evaluate self- the excessive gap technique (Hoda et al. 2010), while other play Monte-Carlo Tree Search (MCTS) in limit Texas approaches lose their theoretical guarantees, e.g. CFR. In Hold’em and Kuhn poker. We introduce a variant of spite of this, CFR has achieved strong performance in three- the established UCB algorithm and provide first empiri- player limit Texas Hold’em (Risk and Szafron 2010). The cal results demonstrating its ability to find approximate MCTS methods discussed in this paper are well defined Nash equilibria. for any number of players but do not yet have any es- tablished theoretical convergence guarantees even for two- Introduction player zero-sum imperfect information games. Reinforcement learning has traditionally focused on station- MCTS is a simulation-based search algorithm that has ary single-agent environments. Its applicability to fully ob- been successful in high-dimensional domains, e.g. computer servable multi-agent Markov games has been explored by Go (Gelly et al. 2012). Selective sampling of trajectories (Littman 1996). Backgammon and computer Go are two of the game enables it to prioritise the most promising re- examples of fully observable two-player games where re- gions of the search space. Furthermore, it only requires a inforcement learning methods have achieved outstanding black box simulator and might therefore plan in complex performance (Tesauro 1992; Gelly et al. 2012). Computer environments, e.g. with approximate generative models of poker provides a diversity of stochastic imperfect informa- real world applications. Finally, it is built on principles of tion games of different sizes and has proved to be a fruitful reinforcement learning and can therefore leverage its well- research domain for and artificial intelligence developed machinery. For example, function approximation (Sandholm 2010; Rubin and Watson 2011). Therefore, it is and bootstrapping are two ideas that can dramatically im- an ideal research subject for self-play reinforcement learn- prove the efficiency of learning algorithms (Tesauro 1992; ing in partially observable multi-agent domains. Silver, Sutton, and Muller¨ 2012). Game-theoretic approaches have played a dominant role (Ponsen, de Jong, and Lanctot 2011) compared the quality in furthering algorithmic performance in computer poker. of policies found by MCTS and MCCFR in computer poker. The game is usually formulated as an extensive-form game They concluded that MCTS quickly finds a good but subop- with imperfect information and most attention has been di- timal policy, while MCCFR initially learns more slowly but rected towards the two-player Texas Hold’em variant. Ab- converges to the optimal policy over time. stracting the game (Billings et al. 2003; Gilpin and Sand- holm 2006), by combining similar nodes, reduces the size of In this paper, we extend the evaluation of MCTS-based the game tree and has allowed various approaches to com- methods in computer poker. We introduce a variant of UCT, pute approximate Nash equilibria. Common game-theoretic Smooth UCT, that combines MCTS with elements of fic- methods in computer poker are based on linear program- titious play and provide first empirical results of its con- ming (Billings et al. 2003; Gilpin and Sandholm 2006), non- vergence to Nash equilibria in Kuhn poker. This algorithm smooth convex optimization (Hoda et al. 2010), fictitious might address the inability of UCT to converge close to a play (Ganzfried and Sandholm 2008) or counterfactual re- over time, while retaining UCT’s fast ini- gret minimization (CFR) (Zinkevich et al. 2007). Except for tial learning rate. Both UCT and Smooth UCT achieve high Monte-Carlo counterfactual regret minimization (MCCFR) performance against competitive approximate Nash equilib- ria in limit Texas Hold’em. This demonstrates that strong Copyright c 2014, Association for the Advancement of Artificial policies can be learned from UCT-based self-play methods Intelligence (www.aaai.org). All rights reserved. in partially observable environments.

19 Background to all policies in ⇡ except ⇡i. Ri(⇡) is the expected reward Extensive-Form Games of player i if all players follow the policy profile ⇡. The set of best responses of player i to his opponents’ policies Extensive-form games are a rich model of multi-agent in- i i i i i i ⇡ is b (⇡ ) = arg max⇡i i R (⇡ , ⇡ ). For ✏ > 0, teraction. The representation is based on a game tree. Intro- i i i i i i 2 i i i i i b✏(⇡ )= ⇡ : R (⇡ , ⇡ ) R (b (⇡ ), ⇡ ) ducing partitions of the tree that constrain the players’ infor- ✏ defines the{ set2 of ✏-best responses to the policy profile mation about the exact position in the game tree allows to ⇡} i. add versatile structure to the game, e.g. simultaneous moves and partial observability. Formally, based on the definition Definition 3. A Nash equilibrium of an extensive-form i i i of (Selten 1975), game is a policy profile ⇡ such that ⇡ b (⇡ ) for all 2 Definition 1. An extensive-form game with imperfect infor- i N. An ✏-Nash equilibrium is a policy profile ⇡ such that i2 i i mation consists of ⇡ b (⇡ ) for all i N. 2 ✏ 2 A set of players N = 1,...,n and a chance player c. • { } A set of states corresponding to nodes in a rooted game MCTS • tree with initialS state s . is the set of terminal 0 SZ ⇢ S states. MCTS (Coulom 2006) is a simulation-based search algo- A partition of the state space into player sets P i, i rithm. It is able to plan in high-dimensional environments by • N c . Each state s P i represents a decision node2 sampling episodes through Monte-Carlo simulation. These of player[ { } i and A(s) is the2 set of actions available to this simulations are guided by an action selection mechanism player in s. In particular, P c is the set of states at which that explores the most promising regions of the state space. chance determines the successor state. This guided search results in efficient, asymmetric search For each player a set of information states i and a surjec- trees. MCTS converges to optimal policies in fully observ- • tive information function i that assigns eachO state s P i able Markovian environments. to an information state o I i. For any information2 state A MCTS algorithm requires the following components. 2 O 1 o i an information set Ii =( i) (o) is a set of Given a state and action, a black box simulator of the game states2 Os P i that are indistinguishableI for player i. samples a successor state and reward. A learning algorithm 2 uses simulated trajectories and outcomes to update some A chance function c(s, a)= (ac = a s = s), P t t statistics of the visited nodes in the search tree. A tree pol- • s P c, that determinesT chance events at chance| nodes. icy is defined by an action selection mechanism that chooses Transitions2 at players’ decision nodes are described by actions based on a node’s statistics. A rollout policy deter- their policies. i mines default behaviour for states that are out of the scope For each player a reward function R that maps terminal of the search tree. • states to payoffs. For a specified amount of planning time the algorithm re- For each player i N the sequence of his informa- 2 peats the following. It starts each Monte-Carlo simulation at tion states and own actions in an episode forms a history the root node and follows its tree policy until either reaching i i i i i i ht = o1,a1,o2,a2,...,ot . A game has perfect recall if a terminal state or the boundary of the search tree. Leav- { } i each player’s current information state ot implies knowledge ing the scope of the search tree, the rollout policy is used to i of his whole history ht of the episode. If a player only knows play out the simulation until reaching a terminal state. In this i some strict subset of his history ht, then the game has im- case, we expand our tree by a state node where we have left perfect recall. the tree. This approach selectively grows the tree in areas The following definition is similar to (Waugh et al. 2009) that are frequently encountered in simulations. After reach- but makes use of information functions. ing a terminal state, the rewards are propagated back so that Definition 2. Let be an extensive-form game. An abstrac- each visited state node can update its statistics. tion of is a collection of surjective functions f i : i Common MCTS keeps track of the following node val- A O ! ˜i that map the information states of to some alternative ues. N(s) is the number of visits by a Monte-Carlo sim- O information state spaces ˜i. By composition of the abstrac- ulation to node s. N(s, a) counts the number of times ac- tion with the informationO functions i of , we obtain alter- tion a has been chosen at node s. Q(s, a) is the estimated ˜i I i i value of choosing action a at node s. The action value native information functions, = fA , that induce an abstracted extensive-form game.I I estimates are usually updated by Monte-Carlo evaluation, Q(s, a)= 1 N(s,a) R (i), where R (i) is the cumu- Mapping multiple information states to a single alterna- N(s,a) i=0 s s lative discounted reward achieved after visiting state s in the tive information state results in the affected player not being P able to distinguish between the mapped states. i-th Monte-Carlo simulation that has encountered s. Each player’s behavioural is determined by his (Kocsis and Szepesvari´ 2006) suggested using the bandit- i i i policy ⇡ (o, a)=P(at = a ot = o), which is a probability based algorithm UCB (Auer, Cesa-Bianchi, and Fischer distribution over actions given| an information state, and i 2002) to select actions in the Monte-Carlo search tree. The is the set of all policies of player i. A policy profile ⇡ =(⇡1, resulting MCTS method, Upper Confidence Trees (UCT), n i ... , ⇡ ) is a collection of policies for all players. ⇡ refers selects greedily between action values that have been en-

20 hanced by an exploration bonus, determines the acting player i’s information state. OUT- OF-TREE keeps track of which player has left the scope log N(s) of his search tree in the current episode. ⇡tree(s) = arg max Q(s, a)+c . (1) a s N(s, a) The basic idea of searching trees over information states is similar to (Auger 2011; Cowling, Powley, and Whitehouse The exploration bonus parameter c adjusts the balance be- 2012). However, algorithm 1 defines a more general class of tween exploration and exploitation. For suitable c the prob- MCTS-based algorithms that can be specified by their action ability of choosing a suboptimal action has been shown to selection and node updating functions. These functions are converge to 0 (Kocsis and Szepesvari´ 2006). The appropri- responsible to sample from and update the tree policy. In this ate scale of c depends on the size of the rewards. work, we focus on UCT-based methods. Other MCTS methods have been proposed in the litera- ture. (Silver and Veness 2010) adapt MCTS to partially ob- Algorithm 1 Self-play MCTS in extensive-form games servable Markov decision processes (POMDPs) and prove SEARCH() convergence given a true initial belief state. (Auger 2011) function within computational budget and (Cowling, Powley, and Whitehouse 2012) study MCTS while do s in imperfect information games and use the bandit method 0 SIMULATE⇠ (s ) EXP3 (Auer et al. 1995) for action selection. 0 end while return ⇡tree Self-Play MCTS in Extensive-Form Games end function Our goal is to use MCTS-based planning to learn an opti- mal policy profile of an extensive-form game with imperfect function ROLLOUT(s) information. a ⇡ (s) ⇠ rollout Planning in a multi-player game requires simulation of s0 (s, a) ⇠ G players’ behaviour. One approach might be to assume ex- return SIMULATE(s0) plicit models of player behaviour and then use a planning end function algorithm to learn a against these models. However, the quality of the policy profile learned with this function SIMULATE(s) approach will depend on the assumed player models. In if ISTERMINAL(s) then particular, to learn an optimal policy profile, we will most return r (s) likely need a player model that produces close to optimal end if ⇠ R behaviour in the first place. We will therefore pursue an al- i = PLAYER(s) ternative approach that does not require this type of prior if OUT-OF-TREE(i) then knowledge. Self-play planning does not assume any explicit return ROLLOUT(s) models of player behaviour. Instead, it samples players’ ac- end if tions from the current policy profile, that is suggested by the oi = i(s) planning process. E.g. self-play MCTS samples player be- if oi /I T i then haviour from the players’ tree policies. EXPANDTREE2 (T i,oi) The asymmetry of information in an extensive-form game a ⇡ (s) ⇠ rollout with imperfect information generally does not allow us to OUT-OF-TREE(i) true span a single collective search tree. We will therefore de- else scribe a general algorithm that uses a separate search tree a = SELECT(T i(oi)) for each player. For each player we grow a tree T i over his i i i end if information states . T (o ) is the node in player i’s tree s0 (s, a) O i ⇠ G that represents his information state o . In a game with per- r SIMULATE(s0) fect recall a player remembers his sequence of previous in- UPDATE (T i(oi),a,ri) i i i i i i formation states and actions ht = o1,a1,o2,a2,...,ot . By return r definition of an extensive-form game,{ this knowledge is} also i i end function represented in the information state ot. Therefore, T is a proper tree. An imperfect recall extensive-form game yields a recombining tree. Algorithm 1 describes MCTS that has been adapted to Information State UCT (IS-UCT) the multi-player, imperfect information setting of extensive- IS-UCT uses UCB to select from and update the tree pol- form games. The are sampled from transi- icy in algorithm 1. It can be seen as a multi-player version tion and reward simulators and . The transition simula- of Partially Observable UCT (PO-UCT) (Silver and Veness tor takes a state and action asG inputsR and generates a sample 2010), that searches trees over histories of information states of a successor state s (s ,ai), where the action ai instead of histories of observations and actions of a POMDP. t+1 ⇠ G t t t belongs to the player i who makes decisions at state st. The In fact, assuming fixed opposing players’ policies, a player reward simulator generates all players’ payoffs at terminal is effectively acting in a POMDP. However, in self-play states, i.e. r (s ). The information function, i(s), planning, player’s policies might never become stationary T ⇠ R T I

21 and keep changing indefinitely. Indeed, (Ponsen, de Jong, Algorithm 2 Smooth IS-UCT and Lanctot 2011) empirically show the inability of UCT SEARCH(), SIMULATE(s) and ROLLOUT(s) as in algo- to converge to a Nash equilibrium in Kuhn poker. However, rithm 1 it seems to unlearn dominated actions and therefore might yield decent but potentially exploitable performance in prac- function SELECT(T i(oi)) tice. u U[0, 1] ⇠ if u<⌘t then Smooth IS-UCT i arg max Q(oi,a)+c log N(o ) To address the lack of convergence guarantees of IS-UCT in return a N(oi,a) partially observable self-play environments, we introduce an else i q i N(o ,a) alternative tree policy algorithm. It has similarities to gener- a A(o ):p(a) i 8 2 N(o ) alised weakened fictitious play (Leslie and Collins 2006), for return a p which convergence results have been established for some end if ⇠ types of games, e.g. two-player zero-sum games. end function A Smooth UCB process is defined by the fol- Definition 4. i i i lowing sequence of policies: function UPDATE(T (o ),a,r ) N(oi) N(oi)+1 i i i i i ⇡t =(1 ⌘t)¯⇡t 1 + ⌘tUCBt, N(o ,a) N(o ,a)+1 i i i i i i r Q(o ,a) a ⇡ Q(o ,a) Q(o ,a)+ t ⇠ t N(oi,a) i i 1 i i end function ⇡¯t =¯⇡t 1 + (at ⇡¯t 1) (2) t for some time t adapted sequence ⌘ [c, 1], c>0. ⇡ t 2 t and multi-street play. As there exist closed-form game- is the behavioural policy profile that is used for planning at theoretic solutions, it is well suited for a first overview of simulation t. Players’ actions, ai, i N, are sampled from t 2 the performance and calibration of different methods. this policy. ⇡¯t is the average policy profile at time t. (Ponsen, de Jong, and Lanctot 2011) tested UCT against UCB plays a greedy best response towards its optimistic MCCFR in Kuhn poker. We conducted a similar experiment action value estimates. Smooth UCB plays a smooth best with Smooth IS-UCT. We calibrated the exploration param- response instead. If UCB were an ✏-best response to the op- eter for both IS-UCT and Smooth IS-UCT, and set it to 2 and i ponents’ average policy profile, ⇡¯ , with ✏ 0 as t , 1.75 respectively. We tested two choices of Smooth UCB’s t ! !1 then the Smooth UCB process would be a generalised weak- mixing parameter sequence ⌘t: ened fictitious play. E.g. this would be the case if the action i i i i ⌘t(o ) = max ⇡¯t 1(o ,a) (3) value estimates were generated from unbiased samples of a 10000 the average policy. However, this is clearly not what hap- ⌘i(oi)=0.9 (4) t i pens in self-play, where at each time players are trying to 10000 + Nt 1(o ) exploit each other rather than playing their average policy. Both worked well in standard Kuhn poker. However, only Intuitively, mixing the current UCB policy with the average p schedule (3) yielded stable performance across several vari- policy might help with this policy evaluation problem. ants of Kuhn poker with varying initial pot size. All reported Algorithm 2 instantiates general self-play MCTS with a results for Smooth IS-UCT on Kuhn poker were generated Smooth UCB tree policy. For a constant ⌘ =1, we obtain t with schedule (3). The average policies’ squared (SQ-ERR) IS-UCT as a special case. We propose two schedules to set and dominated (DOM-ERR) errors were measured after ev- ⌘ in the experiments section. Note that for a cheaply deter- t ery 104 episodes. The squared error of a policy is its mean- mined ⌘ the update and selection steps of Smooth UCB do t squared error measured against the closest Nash equilib- not have any overhead compared to UCB. rium. The dominated error of a policy is the sum of the prob- We are currently working on developing the theory of this abilities with which a dominated action is taken at an infor- algorithm. Our experiments on Kuhn poker provide first em- mation state. The results, shown in figures 1 and 2, were av- pirical evidence of the algorithm’s ability to find approxi- eraged over 10 identically repeated runs of the experiment. mate Nash equilibria. In standard Kuhn poker, Smooth IS-UCT performs better both in terms of squared and dominated error. We repeated Experiments the experiment for variants of Kuhn poker with higher ini- We evaluate IS-UCT and Smooth IS-UCT in Kuhn and tial pot sizes. This is supposed to emulate the varying opti- Texas Hold’em poker games. mal action probabilities that in poker depend on the bet size relative to the pot. Using mixing schedule (3), Smooth IS- Kuhn Poker UCT was able to converge close to a Nash equilibrium in Kuhn poker (Kuhn 1950) is a small extensive-form game each variant. For larger pot sizes, Smooth IS-UCT seems to with imperfect information. It features typical elements of suffer a larger dominated error than IS-UCT. poker, e.g. balancing betting frequencies for different types To conclude, Smooth IS-UCT might focus its self-play of holdings. However, it does not include community cards search on regions of the policy space that are close to a Nash

22 0.05 player i and ct is the sequence of community cards publicly Smooth IS-UCT, SQ-ERR revealed by chance by time t. In Texas Hold’em only the Smooth IS-UCT, DOM-ERR 0.04 private cards constitute hidden information, i.e. each player IS-UCT, SQ-ERR observes everything except his opponents’ private cards. IS-UCT, DOM-ERR g (c, pi) 0.03 Let A N be a function that maps tuples of public community2 cards and private cards to an abstraction i i bucket. The function fA(a¯, c, p )=(a¯,gA(c, p )) defines 0.02 i an abstraction that maps information states = i N O i [ 2 O to alternative information states ˜ = i N ˜ of an ab- 0.01 stracted game. An abstracted informationO [ 2 stateO is a tuple (a¯,n) where a¯ is the unabstracted sequence of all players’ 0 actions except chance and n N identifies the abstraction 0 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 bucket. 2 Simulated Episodes In this work, we use a metric called expected hand strength squared (Zinkevich et al. 2007). At a final betting round and given the corresponding set of community cards, Figure 1: Learning curve of IS-UCT and Smooth IS-UCT the hand strength of a holding is defined as its winning per- in Kuhn poker. The x-axis denotes the number of simulated centage against all possible other holdings. On any betting episodes. The y-axis denotes the quality of a policy in terms round, [HS2] denotes the expected value of the squared of its dominated and squared error measured against a clos- E hand strength on the final betting round. We have discre- est Nash equilibrium. 2 tised the resulting E[HS ] values with an equidistant grid over their range, [0, 1]. Table 1 shows the grid sizes that we 0.25 used in our experiments. We used an imperfect recall ab- Smooth IS-UCT, SQ-ERR 2 Smooth IS-UCT, DOM-ERR straction that does not let a player remember his E[HS ] val- 0.2 i IS-UCT, SQ-ERR ues of previous betting rounds, i.e. gA(c, p ) is set to the 2 IS-UCT, DOM-ERR thresholded E[HS ] value of pi given c. 0.15 Game size Preflop Flop Turn River 0.1 2p S 169 100 100 100 2p L 169 1000 500 200 0.05 3p 169 1000 100 10

0 Table 1: Abstraction bucket discretisation grids used in our 0 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 two and three-player Texas Hold’em experiments Simulated Episodes In limit Texas Hold’em, there is an upper bound on the possible terminal pot size based on the betting that has oc- Figure 2: Learning curve of IS-UCT and Smooth IS-UCT in curred so far in the episode. This is because betting is capped a Kuhn poker variant with an initial pot size of 8. at each betting round. To avoid unnecessary exploration we update the exploration parameter for the current episode at the beginning of each betting round and set it to equilibrium. However, in large games, the number of visits to a node might be too low to provide a stochastic average ct = potsize + k remaining betting potential, (5) policy with small dominated error. ⇤ where k [0, 1] is a constant parameter and the remaining 2 Limit Texas Hold’em betting potential is the maximum possible amount that play- ers can add to the pot in the remainder of the episode. We consider limit Texas Hold’em with two and three play- 18 Each two-player policy was trained for about 7 billion ers. The two-player variants’ game tree contains about 10 episodes, which took about 30 hours on a modern com- nodes. In order to reduce the game tree to a tractable size, puter without using parallelization. Each three-player agent various abstraction techniques have been proposed in the lit- was trained for about 12 billion episodes, requiring about erature (Billings et al. 2003; Gilpin and Sandholm 2006; 48 hours of training time. After training, the policies were Johanson et al. 2013). The most common approach is to frozen to return the greedy, i.e. highest value, action at each group information states according to strategic similarity of information state. Performance is measured in milli-big- the corresponding cards and leave the players’ action se- blinds per hand, mb/h. quences unabstracted. A player’s information state can be described as a tuple Two-player We evaluated the performance of the trained i i ot =(a¯t, ct, pt), where a¯t is the sequence of all players’ policies against benchmark opponents in two-player limit i actions except chance, pt are the two private cards held by Texas Hold’em. Both opponents, Sparbot (Billings et al.

23 2003) and FellOmen2, play approximate Nash equilib- match-ups against dummy agents that blindly play accord- rium strategies. Sparbot is available in the software Poker ing to a fixed probability distribution. We chose two dummy Academy Pro. FellOmen2 is a publicly available poker bot agents that were suggested by (Risk and Szafron 2010). The that was developed by Ian Fellows. It plays an approximate Always-Raise agent, AR, always plays the bet/raise action. Nash equilibrium that was learned from fictitious play. It The Probe agent randomly chooses between the bet/raise placed second, tying three-way, in the AAAI-08 Computer and check/call action with equal probability. Poker Competition. All three-player MCTS policies were trained with explo- All two-player MCTS policies were trained using explo- ration schedule (5) with k =0.5. Smooth IS-UCT used mix- ration schedule (5), with parameter k given in the table. The ing schedule (3), with ⌘t set to zero for nodes that have been Smooth IS-UCT agents used the same mixing schedule (3) visited less than 10000 times in order to avoid disproportion- as in Kuhn poker. The abstraction sizes are shown in table 1. ate exploration at rarely visited nodes. The results in table 2 show an overall positive perfor- The results, shown in table 3, suggest that self-play MCTS mance of UCT-based methods. However, some policies’ per- can train strong three-player policies. IS-UCT outperforms formance is not stable against both opponents. This suggests Smooth IS-UCT in the match-up against two instances of that these policies might not be a good approximation of a Poki. It performs slightly worse than Smooth IS-UCT in the Nash equilibrium. Note that we froze the policies after train- other two match-ups. Both MCTS policies achieve strong re- ing and that these frozen policies are deterministic. This gen- sults against the combination of AR and Probe despite using erally yields exploitable policies, which in turn exploit some imperfect recall. This is notable because (Risk and Szafron types of policies. This might explain some of the contrary 2010) reported anomalous, strongly negative results against results, i.e. very positive performance against Sparbot while this combination of dummy agents when using imperfect re- losing a large amount against FellOmen2. A coarse abstrac- call with CFR. tion might emphasize these effects of freezing policies, as larger amounts of information states would be set to a par- Match-up IS-UCT Smooth IS-UCT ticular action. 2x Poki 163 19 118 19 Due to the large standard errors relative to the differ- Poki, IS-UCT -± 81 ±24 ences in performance, comparing individual results is dif- Poki, Smooth IS-UCT 73 24 ±- ficult. IS-UCT with a coarse abstraction and low exploration AR, Probe 760± 77 910 78 parameter, IS-UCT S, k=0.5, trained the policy that achieves ± ± the most robust performance against both opponents. On Table 3: Three-player limit Texas Hold’em winnings in mb/h the other hand, Smooth IS-UCT trained two policies, with and their 68% confidence intervals. Each match-up was run both fine and coarse abstractions, that achieve robust per- for at least 30000 hands. Player positions were permuted to formance. More significant test results and an evaluation of balance positional advantages. performance for different training times are required to bet- ter compare Smooth IS-UCT against IS-UCT.

Sparbot FellOmen2 Conclusion Coarse abstraction: This work evaluates computer poker policies that have been IS-UCT S, k=1 72 21 -47 20 trained from self-play MCTS. The introduced variant of IS-UCT S, k=0.5 54 ± 21 26 ±21 UCT, Smooth IS-UCT, is shown to converge to a Nash equi- Smooth IS-UCT S, k=1 51 ± 20 7 ±21 librium in a small poker game. Furthermore, it achieves ro- Smooth IS-UCT S, k=0.5 56 ± 20 -41± 20 bust performance in large limit Texas Hold’em games. This ± ± suggests that the policies it learns are close to a Nash equi- Fine abstraction: librium and with increased training time might also converge IS-UCT L, k=0.5 60 20 -17 21 in large games. However, the results were not conclusive Smooth IS-UCT L, k=0.5 18 ± 19 32 ±21 ± ± in demonstrating a higher performance of Smooth IS-UCT than plain IS-UCT, as both UCT-based methods achieved Table 2: Two-player limit Texas Hold’em winnings in mb/h strong performance. and their 68% confidence intervals. Each match-up was run The results against FellOmen2 suggest that self-play for at least 60000 hands. MCTS can be competitive with the state of the art of 2008, a year in which algorithmic agents beat expert human players Three-player In three-player limit Texas Hold’em, we for the first time in a public competition. However, testing tested the trained policies against benchmark and dummy performance against benchmark opponents is only a partial opponents. Poki (Billings 2006) is a bot available in Poker assessment of the quality of a policy. It does not explicitly Academy Pro. Its architecture is based on a mixture of tech- measure the potential exploitability and is therefore insuffi- nologies, including expert knowledge and online Monte- cient to assess a policy’s distance to a Nash equilibrium. Carlo simulations. It won the six-player limit Texas Hold’em We think that the following directions of research could event of the AAAI-08 Computer Poker Competition. Due improve the performance and understanding of self-play to a lack of publicly available benchmark bots capable of MCTS in computer poker. A study of the theoretical prop- playing three-player limit Texas Hold’em, we also included erties of Smooth IS-UCT might unveil a class of algorithms

24 that are theoretically sound in partially observable environ- Hoda, S.; Gilpin, A.; Pena, J.; and Sandholm, T. 2010. ments and converge to Nash equilibria. Obtaining results Smoothing techniques for computing nash equilibria of from longer training periods and comparing the learning rate sequential games. Mathematics of Operations Research against state of the art methods, e.g. MCCFR, should help 35(2):494–512. understanding the role of self-play MCTS in imperfect infor- Johanson, M.; Burch, N.; Valenzano, R.; and Bowling, M. mation games. A more sophisticated abstraction technique, 2013. Evaluating state-space abstractions in extensive-form e.g. discretisation through clustering or distribution-aware games. In Proceedings of the 2013 international conference methods (Johanson et al. 2013), should improve the perfor- on Autonomous agents and multi-agent systems, 271–278. mance in Texas Hold’em. Using local or online search to Kocsis, L., and Szepesvari,´ C. 2006. Bandit based plan for the currently played scenario could scale to multi- monte-carlo planning. In Machine Learning: ECML 2006. player games with more than three players. Due to their fast Springer. 282–293. initial learning rate, this might be a particularly suitable ap- plication of MCTS-based methods. Kuhn, H. W. 1950. A simplified two-person poker. Contri- butions to the Theory of Games 1:97–103. Acknowledgments Lanctot, M.; Waugh, K.; Zinkevich, M.; and Bowling, M. H. This research was supported by the UK PhD Centre in Fi- 2009. Monte carlo sampling for regret minimization in ex- nancial Computing and Google DeepMind. tensive games. In NIPS, 1078–1086. Leslie, D. S., and Collins, E. J. 2006. Generalised weakened References fictitious play. Games and Economic Behavior 56(2):285– Auer, P.; Cesa-Bianchi, N.; Freund, Y.; and Schapire, R. E. 298. 1995. Gambling in a rigged casino: The adversarial multi- Littman, M. L. 1996. Algorithms for sequential decision armed bandit problem. In Foundations of Computer Science, making. Ph.D. Dissertation, Brown University. 1995. Proceedings., 36th Annual Symposium on, 322–331. Ponsen, M.; de Jong, S.; and Lanctot, M. 2011. Comput- IEEE. ing approximate nash equilibria and robust best-responses Auer, P.; Cesa-Bianchi, N.; and Fischer, P. 2002. Finite- using sampling. Journal of Artificial Intelligence Research time analysis of the multiarmed bandit problem. Machine 42(1):575–605. learning 47(2-3):235–256. Risk, N. A., and Szafron, D. 2010. Using counterfactual Auger, D. 2011. Multiple tree for partially observable regret minimization to create competitive multiplayer poker monte-carlo tree search. In Applications of Evolutionary agents. In Proceedings of the 9th International Conference Computation. Springer. 53–62. on Autonomous Agents and Multiagent Systems: volume 1- Billings, D.; Burch, N.; Davidson, A.; Holte, R.; Schaeffer, Volume 1, 159–166. J.; Schauenberg, T.; and Szafron, D. 2003. Approximat- Rubin, J., and Watson, I. 2011. Computer poker: A review. ing game-theoretic optimal strategies for full-scale poker. In Artificial Intelligence 175(5):958–987. IJCAI, 661–668. Sandholm, T. 2010. The state of solving large incomplete- Billings, D. 2006. Algorithms and assessment in computer information games, and application to poker. AI Magazine poker. Ph.D. Dissertation, University of Alberta. 31(4):13–32. Coulom, R. 2006. Efficient selectivity and backup operators Selten, R. 1975. Reexamination of the perfectness con- in monte-carlo tree search. In 5th International Conference cept for equilibrium points in extensive games. International on Computer and Games. journal of game theory 4(1):25–55. Cowling, P. I.; Powley, E. J.; and Whitehouse, D. 2012. In- Silver, D., and Veness, J. 2010. Monte-carlo planning in formation set monte carlo tree search. Computational Intel- large pomdps. In Advances in Neural Information Process- ligence and AI in Games, IEEE Transactions on 4(2):120– ing Systems, 2164–2172. 143. Silver, D.; Sutton, R. S.; and Muller,¨ M. 2012. Temporal- Ganzfried, S., and Sandholm, T. 2008. Computing an ap- difference search in computer go. Machine learning proximate jam/fold equilibrium for 3-player no-limit texas 87(2):183–219. hold’em tournaments. In Proceedings of the 7th interna- Tesauro, G. 1992. Practical issues in temporal difference tional joint conference on Autonomous agents and multia- learning. In Reinforcement Learning. Springer. 33–53. gent systems-Volume 2, 919–925. Waugh, K.; Schnizlein, D.; Bowling, M.; and Szafron, D. Gelly, S.; Kocsis, L.; Schoenauer, M.; Sebag, M.; Silver, D.; 2009. Abstraction pathologies in extensive games. In Szepesvari,´ C.; and Teytaud, O. 2012. The grand chal- Proceedings of The 8th International Conference on Au- lenge of computer go: Monte carlo tree search and exten- tonomous Agents and Multiagent Systems-Volume 2, 781– sions. Communications of the ACM 55(3):106–113. 788. Gilpin, A., and Sandholm, T. 2006. A competitive texas Zinkevich, M.; Johanson, M.; Bowling, M.; and Piccione, hold’em poker player via automated abstraction and real- C. 2007. Regret minimization in games with incomplete Proceedings of the Na- time equilibrium computation. In information. In Advances in neural information processing tional Conference on Artificial Intelligence , volume 21, systems, 1729–1736. 1007.

25