Sparse Sampling for Adversarial Games
Total Page:16
File Type:pdf, Size:1020Kb
Sparse Sampling for Adversarial Games Marc Lanctot1, Abdallah Saffidine2, Joel Veness1, Chris Archibald1 1University of Alberta, Edmonton, Alberta, Canada 2LAMSADE, Universite´ Paris-Dauphine, Paris, France Abstract. This paper introduces Monte Carlo *-Minimax Search (MCMS), a Monte-Carlo search algorithm for finite, turned based, stochastic, two-player, zero-sum games of perfect information. Through a combination of sparse sam- pling and classical pruning techniques, MCMS allows deep plans to be con- structed. Unlike other popular tree search techniques, MCMS is suitable for densely stochastic games, i.e., games where one would never expect to sample the same state twice. We give a basis for the theoretical properties of the al- gorithm and evaluate its performance in three games: Pig (Pig Out), EinStein Wurfelt¨ Nicht!, and Can’t Stop. 1 Introduction Monte-Carlo Tree Search (MCTS) has recently become one of the dominant paradigms for online planning in large sequential games. Since its initial application to Computer Go [Gelly et al., 2006, Coulom, 2007a], numerous extensions have been proposed, al- lowing this general approach [Chaslot et al., 2008a] to be successfully adapted to a variety of challenging problem settings, including real-time strategy games, imperfect information games and General Game Playing [Finnsson and Bjornsson,¨ 2008, Szita et al., 2010, Winands et al., 2010, Auger, 2011, Ciancarini and Favini, 2010]. At first, the method was applied to games lacking strong Minimax players, but recently has been shown to compete against strong Minimax players in such games [Ramanujan and Selman, 2011, Winands and Bjornsson,¨ 2010]. One class of games that has proven more resistant is stochastic games. Unlike classi- cal games such as Chess and Go, stochastic game trees include chance nodes in addition to decision nodes. How MCTS should account for this added uncertainty remains un- clear. The classical algorithms for stochastic games, Expectimax (exp) and *-Minimax (Star1 and Star2), perform look-ahead searches to a limited depth. However, both algo- rithms scale exponentially in the branching factor at chance nodes as the search horizon is increased. Hence, their performance in large games often depends heavily on the quality of the heuristic evaluation function, as only shallow searches are possible. One way to handle the uncertainty at chance nodes is to simply sample a single outcome when encountering a chance node. This is common practice in MCTS when applied to stochastic games; however, the general performance of this method is ques- tionable. Large stochastic domains still pose a significant challenge. For example, MCTS is outperformed by *-Minimax in the game of Carcassonne [Heyden, 2009]. Unfortu- nately, the literature on the application of Monte Carlo search methods to stochastic 2 Marc Lanctot, Abdallah Saffidine, Joel Veness, Chris Archibald games is relatively small, possibly due to the lack of a principled and practical algo- rithm for these domains. In this paper, we introduce a new algorithm, called Monte-Carlo Minimax Search (MCMS), which can increase the performance of search algorithms in stochastic games by sampling a subset of chance event outcomes. We describe a sampling technique for chance nodes based on sparse sampling [Kearns et al., 1999]. We present a theorem that shows that MCMS approaches the optimal decision as the number of samples grows. We evaluate the practical performance of MCMS in three domains: Pig (Pig Out), EinStein Wurfelt¨ Nicht! (EWN), and Can’t Stop. Finally, we show in Pig (Pig Out) that the es- timates returned by MCMS have lower mean-squared error and lower regret than the estimates returned by MCTS. 2 Background A finite, two-player zero-sum game of perfect information can be described as a tuple (S; T ; A; P; u1; s1), which we now define. The state space S is a finite, non-empty set of states, with T ⊆ S denoting the finite, non-empty set of terminal states. The action space A is a finite, non-empty set of actions. The transition probability function P assigns to each state-action pair (s; a) 2 S × A a probability measure over S that we denote by P(· j s; a). The utility function u1 : T 7! [vmin; vmax] ⊂ R gives the utility of player 1, with vmin and vmax denoting the minimum and maximum possible utility respectively. Since the game is zero-sum, the utility of player 2 in any state s 2 T is given by u2(s) := −u1(s). The player index function τ : S ! f1; 2g returns the player to act in given state s if s 2 S n T , otherwise it returns 1 in the case where s 2 T . Each game starts in the initial state s1 with τ(s1) := 1. It proceeds as follows, for each time step t 2 N: first, player τ(st) selects an action at 2 A in state st, with the next state st+1 generated according to P(· j st; at). Player τ(st+1) then chooses a next action and the cycle continues until some terminal state sT 2 T is reached. At this point player 1 and player 2 receive a utility of u1(sT ) and u2(sT ) respectively. 2.1 Classical Game Tree Search We now describe the two main search paradigms for adversarial stochastic game tree search. We begin first by describing the classical techniques, that differ from modern approaches in that they do not use Monte-Carlo sampling. The minimax value of a state s 2 S is defined by 8 max P P(s0 j s; a) V (s0) if s2 = T and τ(s) = 1 > <> a2A s02S V (s) := min P P(s0 j s; a) V (s0) if s2 = T and τ(s) = 2 a2A 0 > s 2S : u1(s) otherwise, Note that we always treat player 1 as the player maximizing u1(s) (Max), and player 2 as the player minimizing u1(s) (Min). In most large games, computing the minimax value for a given game state is intractable. Because of this, an often used approxi- mation is to instead compute the finite horizon minimax value. This requires limiting Sparse Sampling for Adversarial Games 3 the recursion to some fixed depth d 2 N and applying a heuristic evaluation function h : S 7! [vmin; vmax] when this depth limit is reached. Thus given a heuristic evaluation function h1(s): S! R defined with respect to player 1 that satisfies the requirement h1(s) = u1(s) when s 2 T , the finite horizon minimax value is defined recursively by 8 max Vd(s; a) if d > 0; s 62 T ; and τ(s) = 1 <> a2A Vd(s) := min Vd(s; a) if d > 0; s 62 T ; and τ(s) = 2 a2A :> h1(s) otherwise; where X 0 0 Vd(s; a) := P(s j s; a) Vd−1(s ); (1) s02S For sufficiently large d, Vd(s) coincides with V (s). The quality of the approximation depends on both the heuristic evaluation function and the search depth parameter d. A direct computation of arg maxa2A(s) Vd(s; a) or arg mina2A(s) Vd(s; a) is equivalent to running the well known EXPECTIMINIMAX algorithm [Russell and Norvig, 2010]. The base EXPECTIMINIMAX algorithm can be enhanced by a technique similar to alpha-beta pruning for deterministic game tree search. This involves correctly propagating the [α; β] bounds and performing an additional pruning step at each chance node. This pruning step is based on the simple observation that if the minimax value has already been computed for a subset of successors S~ ⊂ S, the negamax value of state-action pair (s; a) must lie within Ld(s; a) ≤ Vd(s; a) ≤ Ud(s; a); where X 0 0 X 0 Ld(s; a) := P(s j s; a)Vd−1(s ) + P(s j s; a)vmin s02S~ s02SnS~ X 0 0 X 0 Ud(s; a) := P(s j s; a)Vd−1(s ) + P(s j s; a)vmax: s02S~ s02SnS~ These bounds form the basis of the pruning mechanisms in the *-Minimax [Ballard, 1983] family of algorithms. In the Star1 algorithm, each s0 from the equations above represents the state reached after a particular outcome is applied at a chance node fol- 0 lowing (s; a). In practice, Star1 maintains lower and upper bounds on Vd−1(s ) for each child s0 at chance nodes, using this information to stop the search when it finds a proof that any future search is pointless. To better understand when cutoffs occur in *-Minimax, we now present an exam- ple adapted from Ballard’s original paper. Consider Figure 1. The algorithm recurses down from state s with a window of [α; β] = [4; 5] and encounters a chance node. Without having searched any of the children the bounds for the values returned are 0 (vmin; vmax) = (−10; +10). The subtree of a child, say s , is searched and returns 0 Vd−1(s ) = 2. Since this is now known, the upper and lower bounds for that out- come become 2. The lower bound on the minimax value of the chance node becomes (2 − 10 − 10)=3 and the upper bound becomes (2 + 10 + 10)=3, assuming a uniform distribution over chance events. If ever the lower bound on the value of the chance 4 Marc Lanctot, Abdallah Saffidine, Joel Veness, Chris Archibald s [4; 5] ∗ 2 ? s0 Fig. 1: An example of the STAR1 algorithm. Algorithm 1 Star1 1: Star1(s; a; d; α; β, c) 2: if d = 0 or s 2 T then return (h1(s); null) 3: else if :c then return alphabeta1(s; d; α, β) 4: else 5: o genOutcomeSet(s; a; vmin; vmax) 6: N joj 7: for i 2 f0;:::;N − 1g 8: α0 computeChildAlpha(o; α; i); β0 computeChildBeta(o; β; i) 9: s0 applyActionAndChanceOutcome(s; a; i) 0 0 0 0 10: (v; a ) Star1(s ; null; d − 1; max(vmin; α ); min(vmax; β ); false) 11: oil v; oiu v 12: if v ≥ β0 then return (lowerBound(o), null) 13: if v ≤ α0 then return (upperBound(o), null) 14: return exactValue(o) node exceeds β, or if the upper bound for the chance node is less than α, the subtree is pruned.