Monte Carlo *-Minimax Search

Monte Carlo *-Minimax Search

Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence Monte Carlo *-Minimax Search Marc Lanctot Abdallah Saffidine Department of Knowledge Engineering LAMSADE, Maastricht University, Netherlands Universite´ Paris-Dauphine, France [email protected] abdallah.saffi[email protected] Joel Veness and Christopher Archibald, Mark H. M. Winands Department of Computing Science Department of Knowledge Engineering University of Alberta, Canada Maastricht University, Netherlands fveness@cs., [email protected] [email protected] Abstract the search enhancements from the classic αβ literature can- not be easily adapted to MCTS. The classic algorithms for This paper introduces Monte Carlo *-Minimax stochastic games, EXPECTIMAX and *-Minimax (Star1 and Search (MCMS), a Monte Carlo search algorithm Star2), perform look-ahead searches to a limited depth. How- for turned-based, stochastic, two-player, zero-sum ever, the running time of these algorithms scales exponen- games of perfect information. The algorithm is de- tially in the branching factor at chance nodes as the search signed for the class of densely stochastic games; horizon is increased. Hence, their performance in large games that is, games where one would rarely expect to often depends heavily on the quality of the heuristic evalua- sample the same successor state multiple times at tion function, as only shallow searches are possible. any particular chance node. Our approach com- bines sparse sampling techniques from MDP plan- One way to handle the uncertainty at chance nodes would ning with classic pruning techniques developed for be forward pruning [Smith and Nau, 1993], but the perfor- adversarial expectimax planning. We compare and mance gain until now has been small [Schadd et al., 2009]. contrast our algorithm to the traditional *-Minimax Another way is to simply sample a single outcome when approaches, as well as MCTS enhanced with the encountering a chance node. This is common practice in Double Progressive Widening, on four games: Pig, MCTS when applied to stochastic games. However, the gen- EinStein Wurfelt¨ Nicht!, Can’t Stop, and Ra. Our eral performance of this method is unknown. Large stochas- results show that MCMS can be competitive with en- tic domains still pose a significant challenge. For instance, hanced MCTS variants in some domains, while con- MCTS is outperformed by *-Minimax in the game of Carcas- sistently outperforming the equivalent classic ap- sonne [Heyden, 2009]. Unfortunately, the literature on the ap- proaches given the same amount of thinking time. plication of Monte Carlo search methods to stochastic games is relatively small. 1 Introduction In this paper, we investigate the use of Monte Carlo sam- Monte Carlo sampling has recently become a popular tech- pling in *-Minimax search. We introduce a new algorithm, nique for online planning in large sequential games. For ex- Monte Carlo *-Minimax Search (MCMS), which samples a ample UCT and, more generally, Monte Carlo Tree Search subset of chance node outcomes in EXPECTIMAX and *- (MCTS) [Kocsis and Szepesvari,´ 2006; Coulom, 2007b] has Minimax in stochastic games. In particular, we describe a led to an increase in the performance of Computer Go play- sampling technique for chance nodes based on sparse sam- ers [Lee et al., 2009], and numerous extensions and appli- pling [Kearns et al., 1999] and show that MCMS approaches cations have since followed [Browne et al., 2012]. Initially, the optimal decision as the number of samples grows. We MCTS was applied to games lacking strong Minimax players, evaluate the practical performance of MCMS in four domains: but recently has been shown to compete against strong Mini- Pig, EinStein Wurfelt¨ Nicht!, Can’t Stop, and Ra. In Pig, we max players in such games [Winands et al., 2010; Ramanujan show that the estimates returned by MCMS have lower bias and Selman, 2011]. One class of games that has proven more and lower regret than the estimates returned by the classic resistant is stochastic games. Unlike classic games such as *-Minimax algorithms. Finally, we show that the addition of Chess and Go, stochastic game trees include chance nodes in sampling to *-Minimax can increase its performance from in- addition to decision nodes. How MCTS should account for ferior to competitive against state-of-the-art MCTS, and in the this added uncertainty remains unclear. Moreover, many of case of Ra, can even perform better than MCTS. 580 2 Background A direct computation of arg maxa2A(s) Vd(s; a) or A finite, two-player zero-sum game of perfect information arg mina2A(s) Vd(s; a) is equivalent to running the well EXPECTIMAX can be described as a tuple (S; T ; A; P; u1; s1), which we known algorithm [Michie, 1966]. The base now define. The state space S is a finite, non-empty set of EXPECTIMAX algorithm can be enhanced by a technique sim- states, with T ⊆ S denoting the finite, non-empty set of ilar to αβ pruning [Knuth and Moore, 1975] for determinis- terminal states. The action space A is a finite, non-empty tic game tree search. This involves correctly propagating the set of actions. The transition probability function P assigns [α; β] bounds and performing an additional pruning step at to each state-action pair (s; a) 2 S × A a probability mea- each chance node. This pruning step is based on the observa- sure over S that we denote by P(· j s; a). The utility function tion that if the minimax value has already been computed for ~ u1 : T 7! [vmin; vmax] ⊆ R gives the utility of player 1, with a subset of successors S ⊆ S, the depth d minimax value of vmin and vmax denoting the minimum and maximum possible state-action pair (s; a) must lie within utility, respectively. Since the game is zero-sum, the utility of Ld(s; a) ≤ Vd(s; a) ≤ Ud(s; a); player 2 in any state s 2 T is given by u2(s) := −u1(s). The player index function τ : S n T ! f1; 2g returns the player where to act in a given non-terminal state s. X 0 0 X 0 Each game starts in the initial state s1 with τ(s1) := 1, Ld(s; a) = P(s j s; a)Vd−1(s )+ P(s j s; a)vmin and proceeds as follows. For each time step t 2 N, player s02S~ s02SnS~ τ(st) selects an action at 2 A in state st, with the next state st+1 generated according to P(· j st; at). Player τ(st+1) then X 0 0 X 0 Ud(s; a) = P(s j s; a)Vd−1(s )+ P(s j s; a)vmax: chooses a next action and the cycle continues until some ter- s02S~ s02SnS~ minal state sT 2 T is reached. At this point player 1 and player 2 receive a utility of u1(sT ) and u2(sT ) respectively. These bounds form the basis of the pruning mechanisms in the *-Minimax [Ballard, 1983] family of algorithms. In the 2.1 Classic Game Tree Search Star1 algorithm, each s0 from the equations above represents We now describe the two main search paradigms for adversar- the state reached after a particular outcome is applied at a ial stochastic game tree search. We begin by first describing chance node following (s; a). In practice, Star1 maintains 0 0 classic stochastic search techniques, that differ from modern lower and upper bounds on Vd−1(s ) for each child s at approaches in that they do not use Monte Carlo sampling. chance nodes, using this information to stop the search when This requires recursively defining the minimax value of a it finds a proof that any future search is pointless. A worked state s 2 S, which is given by example of how these cuts occur in *-Minimax can be found in [Lanctot et al., 2013]. 8 max P P(s0 j s; a) V (s0) if s2 = T ; τ(s) = 1 > a2A s02S < 0 0 V (s) = min P P(s j s; a) V (s ) if s2 = T ; τ(s) = 2 1 Star1(s; a; d; α; β) a2A 0 > s 2S 2 if d = 0 or s 2 T then return h(s) :> u1(s) otherwise. 3 Note that here we always treat player 1 as the player maxi- 4 else 5 O genOutcomeSet(s, a) mizing u1(s) (Max), and player 2 as the player minimizing u (s) (Min). In most large games, computing the minimax 6 for o 2 O do 1 0 value for a given game state is intractable. Because of this, an 7 α childAlpha(o, α) 0 often used approximation is to instead compute the depth d 8 β childBeta (o, β) 0 minimax value. This requires limiting the recursion to some 9 s actionChanceEvent (s, a, o) 0 0 0 10 v alphabeta1(s , d − 1, α , β ) fixed depth d 2 N and applying a heuristic evaluation func- tion when this depth limit is reached. Thus given a heuristic 11 ol v; ou v 0 12 if v ≥ β then return pess(O) evaluation function h : S! [vmin; vmax] ⊆ R defined with respect to player 1 that satisfies the requirement h(s) = u (s) 13 1 0 when s 2 T , the depth d minimax value is defined recursively 14 if v ≤ α then return opti(O) by 15 8 16 return Vd(s; a) max Vd(s; a) if d > 0, s 62 T , and τ(s) = 1 Algorithm 1: Star1 <> a2A Vd(s) = min Vd(s; a) if d > 0, s 62 T , and τ(s) = 2 > a2A : h(s) otherwise, The algorithm is summarized in Algorithm 1. The alphabeta1 procedure recursively calls Star1.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    7 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us