Monte Carlo *-Minimax Search
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence Monte Carlo *-Minimax Search Marc Lanctot Abdallah Saffidine Department of Knowledge Engineering LAMSADE, Maastricht University, Netherlands Universite´ Paris-Dauphine, France [email protected] abdallah.saffi[email protected] Joel Veness and Christopher Archibald, Mark H. M. Winands Department of Computing Science Department of Knowledge Engineering University of Alberta, Canada Maastricht University, Netherlands fveness@cs., [email protected] [email protected] Abstract the search enhancements from the classic αβ literature can- not be easily adapted to MCTS. The classic algorithms for This paper introduces Monte Carlo *-Minimax stochastic games, EXPECTIMAX and *-Minimax (Star1 and Search (MCMS), a Monte Carlo search algorithm Star2), perform look-ahead searches to a limited depth. How- for turned-based, stochastic, two-player, zero-sum ever, the running time of these algorithms scales exponen- games of perfect information. The algorithm is de- tially in the branching factor at chance nodes as the search signed for the class of densely stochastic games; horizon is increased. Hence, their performance in large games that is, games where one would rarely expect to often depends heavily on the quality of the heuristic evalua- sample the same successor state multiple times at tion function, as only shallow searches are possible. any particular chance node. Our approach com- bines sparse sampling techniques from MDP plan- One way to handle the uncertainty at chance nodes would ning with classic pruning techniques developed for be forward pruning [Smith and Nau, 1993], but the perfor- adversarial expectimax planning. We compare and mance gain until now has been small [Schadd et al., 2009]. contrast our algorithm to the traditional *-Minimax Another way is to simply sample a single outcome when approaches, as well as MCTS enhanced with the encountering a chance node. This is common practice in Double Progressive Widening, on four games: Pig, MCTS when applied to stochastic games. However, the gen- EinStein Wurfelt¨ Nicht!, Can’t Stop, and Ra. Our eral performance of this method is unknown. Large stochas- results show that MCMS can be competitive with en- tic domains still pose a significant challenge. For instance, hanced MCTS variants in some domains, while con- MCTS is outperformed by *-Minimax in the game of Carcas- sistently outperforming the equivalent classic ap- sonne [Heyden, 2009]. Unfortunately, the literature on the ap- proaches given the same amount of thinking time. plication of Monte Carlo search methods to stochastic games is relatively small. 1 Introduction In this paper, we investigate the use of Monte Carlo sam- Monte Carlo sampling has recently become a popular tech- pling in *-Minimax search. We introduce a new algorithm, nique for online planning in large sequential games. For ex- Monte Carlo *-Minimax Search (MCMS), which samples a ample UCT and, more generally, Monte Carlo Tree Search subset of chance node outcomes in EXPECTIMAX and *- (MCTS) [Kocsis and Szepesvari,´ 2006; Coulom, 2007b] has Minimax in stochastic games. In particular, we describe a led to an increase in the performance of Computer Go play- sampling technique for chance nodes based on sparse sam- ers [Lee et al., 2009], and numerous extensions and appli- pling [Kearns et al., 1999] and show that MCMS approaches cations have since followed [Browne et al., 2012]. Initially, the optimal decision as the number of samples grows. We MCTS was applied to games lacking strong Minimax players, evaluate the practical performance of MCMS in four domains: but recently has been shown to compete against strong Mini- Pig, EinStein Wurfelt¨ Nicht!, Can’t Stop, and Ra. In Pig, we max players in such games [Winands et al., 2010; Ramanujan show that the estimates returned by MCMS have lower bias and Selman, 2011]. One class of games that has proven more and lower regret than the estimates returned by the classic resistant is stochastic games. Unlike classic games such as *-Minimax algorithms. Finally, we show that the addition of Chess and Go, stochastic game trees include chance nodes in sampling to *-Minimax can increase its performance from in- addition to decision nodes. How MCTS should account for ferior to competitive against state-of-the-art MCTS, and in the this added uncertainty remains unclear. Moreover, many of case of Ra, can even perform better than MCTS. 580 2 Background A direct computation of arg maxa2A(s) Vd(s; a) or A finite, two-player zero-sum game of perfect information arg mina2A(s) Vd(s; a) is equivalent to running the well EXPECTIMAX can be described as a tuple (S; T ; A; P; u1; s1), which we known algorithm [Michie, 1966]. The base now define. The state space S is a finite, non-empty set of EXPECTIMAX algorithm can be enhanced by a technique sim- states, with T ⊆ S denoting the finite, non-empty set of ilar to αβ pruning [Knuth and Moore, 1975] for determinis- terminal states. The action space A is a finite, non-empty tic game tree search. This involves correctly propagating the set of actions. The transition probability function P assigns [α; β] bounds and performing an additional pruning step at to each state-action pair (s; a) 2 S × A a probability mea- each chance node. This pruning step is based on the observa- sure over S that we denote by P(· j s; a). The utility function tion that if the minimax value has already been computed for ~ u1 : T 7! [vmin; vmax] ⊆ R gives the utility of player 1, with a subset of successors S ⊆ S, the depth d minimax value of vmin and vmax denoting the minimum and maximum possible state-action pair (s; a) must lie within utility, respectively. Since the game is zero-sum, the utility of Ld(s; a) ≤ Vd(s; a) ≤ Ud(s; a); player 2 in any state s 2 T is given by u2(s) := −u1(s). The player index function τ : S n T ! f1; 2g returns the player where to act in a given non-terminal state s. X 0 0 X 0 Each game starts in the initial state s1 with τ(s1) := 1, Ld(s; a) = P(s j s; a)Vd−1(s )+ P(s j s; a)vmin and proceeds as follows. For each time step t 2 N, player s02S~ s02SnS~ τ(st) selects an action at 2 A in state st, with the next state st+1 generated according to P(· j st; at). Player τ(st+1) then X 0 0 X 0 Ud(s; a) = P(s j s; a)Vd−1(s )+ P(s j s; a)vmax: chooses a next action and the cycle continues until some ter- s02S~ s02SnS~ minal state sT 2 T is reached. At this point player 1 and player 2 receive a utility of u1(sT ) and u2(sT ) respectively. These bounds form the basis of the pruning mechanisms in the *-Minimax [Ballard, 1983] family of algorithms. In the 2.1 Classic Game Tree Search Star1 algorithm, each s0 from the equations above represents We now describe the two main search paradigms for adversar- the state reached after a particular outcome is applied at a ial stochastic game tree search. We begin by first describing chance node following (s; a). In practice, Star1 maintains 0 0 classic stochastic search techniques, that differ from modern lower and upper bounds on Vd−1(s ) for each child s at approaches in that they do not use Monte Carlo sampling. chance nodes, using this information to stop the search when This requires recursively defining the minimax value of a it finds a proof that any future search is pointless. A worked state s 2 S, which is given by example of how these cuts occur in *-Minimax can be found in [Lanctot et al., 2013]. 8 max P P(s0 j s; a) V (s0) if s2 = T ; τ(s) = 1 > a2A s02S < 0 0 V (s) = min P P(s j s; a) V (s ) if s2 = T ; τ(s) = 2 1 Star1(s; a; d; α; β) a2A 0 > s 2S 2 if d = 0 or s 2 T then return h(s) :> u1(s) otherwise. 3 Note that here we always treat player 1 as the player maxi- 4 else 5 O genOutcomeSet(s, a) mizing u1(s) (Max), and player 2 as the player minimizing u (s) (Min). In most large games, computing the minimax 6 for o 2 O do 1 0 value for a given game state is intractable. Because of this, an 7 α childAlpha(o, α) 0 often used approximation is to instead compute the depth d 8 β childBeta (o, β) 0 minimax value. This requires limiting the recursion to some 9 s actionChanceEvent (s, a, o) 0 0 0 10 v alphabeta1(s , d − 1, α , β ) fixed depth d 2 N and applying a heuristic evaluation func- tion when this depth limit is reached. Thus given a heuristic 11 ol v; ou v 0 12 if v ≥ β then return pess(O) evaluation function h : S! [vmin; vmax] ⊆ R defined with respect to player 1 that satisfies the requirement h(s) = u (s) 13 1 0 when s 2 T , the depth d minimax value is defined recursively 14 if v ≤ α then return opti(O) by 15 8 16 return Vd(s; a) max Vd(s; a) if d > 0, s 62 T , and τ(s) = 1 Algorithm 1: Star1 <> a2A Vd(s) = min Vd(s; a) if d > 0, s 62 T , and τ(s) = 2 > a2A : h(s) otherwise, The algorithm is summarized in Algorithm 1. The alphabeta1 procedure recursively calls Star1.