Monte-Carlo Simulation Balancing
Total Page:16
File Type:pdf, Size:1020Kb
Monte-Carlo Simulation Balancing David Silver [email protected] Department of Computing Science, University of Alberta, Edmonton, AB Gerald Tesauro [email protected] IBM Watson Research Center, 19 Skyline Drive, Hawthorne, NY Abstract in a search-tree by Monte-Carlo simulation. It has proven surprisingly successful in deterministic two- In this paper we introduce the first algo- player games, achieving master-level at 9 × 9 Go rithms for efficiently learning a simulation (Gelly & Silver, 2007; Coulom, 2007) and winning policy for Monte-Carlo search. Our main idea the General Game-Playing competition (Finnsson & is to optimise the balance of a simulation pol- Bj¨ornsson,2008). icy, so that an accurate spread of simulation outcomes is maintained, rather than optimis- In these algorithms, many games of self-play are sim- ing the direct strength of the simulation pol- ulated, using a simulation policy to select actions for icy. We develop two algorithms for balanc- both players. The overall performance of Monte-Carlo ing a simulation policy by gradient descent. search is largely determined by the simulation policy. The first algorithm optimises the balance of A simulation policy with appropriate domain knowl- complete simulations, using a policy gradient edge can dramatically outperform a uniform random algorithm; whereas the second algorithm op- simulation policy (Gelly et al., 2006). Automatically timises the balance over every two steps of improving the simulation policy is a major goal of simulation. We compare our algorithms to current research in this area (Gelly & Silver, 2007; reinforcement learning and supervised learn- Coulom, 2007; Chaslot et al., 2008). Two approaches ing algorithms for maximising the strength have previously been taken to improving the simula- of the simulation policy. We test each al- tion policy. gorithm in the domain of 5 × 5 and 6 × 6 The first approach is to directly construct a strong Computer Go, using a softmax policy that is simulation policy that performs well by itself, either parameterised by weights for a hundred sim- by hand (Billings et al., 1999), reinforcement learning ple patterns. When used in a simple Monte- (Tesauro & Galperin, 1996; Gelly & Silver, 2007), or Carlo search, the policies learnt by simulation supervised learning (Coulom, 2007). Unfortunately, balancing achieved significantly better per- a stronger simulation policy can actually lead to a formance, with half the mean squared error of weaker Monte-Carlo search (Gelly & Silver, 2007), a a uniform random policy, and similar overall paradox that we explore further in this paper. performance to a sophisticated Go engine. The second approach to learning a simulation policy is by trial and error, adjusting parameters and testing for 1. Introduction improvements in the performance of the Monte-Carlo player, either by hand (Gelly et al., 2006), or by hill- Monte-Carlo search algorithms use the average out- climbing (Chaslot et al., 2008). However, each param- come of many simulations to evaluate candidate ac- eter evaluation usually requires many complete games, tions. They have achieved human master level in thousands of positions, and millions of simulations to a variety of stochastic two-player games, including be executed. Furthermore, hill-climbing methods do Backgammon (Tesauro & Galperin, 1996), Scrabble not scale well with increasing dimensionality, and fare (Sheppard, 2002) and heads up Poker (Billings et al., poorly with complex policy parameterisations. 1999). Monte-Carlo tree search evaluates each state Handcrafting an effective simulation policy is partic- Appearing in Proceedings of the 26 th International Confer- This work was supported in part under the DARPA ence on Machine Learning, Montreal, Canada, 2009. Copy- GALE project, contract No. HR0011-08-C-0110. right 2009 by the author(s)/owner(s). Monte-Carlo Simulation Balancing ularly problematic in Go. Many of the top Go pro- In real-world domains, knowledge of the true min- grams utilise a small number of simple patterns and imax values is not available. In practice, we use rules, based largely on the default policy used in MoGo the values V^ ∗(s) computed by deep Monte-Carlo tree (Gelly et al., 2006). Adding further Go knowledge searches, which converge on the minimax value in the without breaking MoGo's \magic formula" has proven limit (Kocsis & Szepesvari, 2006), as an approximation to be surprisingly difficult. V^ ∗(s) ≈ V ∗(s). In this paper we introduce a new paradigm for learning At every time-step t, each player's move incurs some ∗ ∗ a simulation policy. We define an objective function, error δt = V (st+1) − V (st) with respect to the min- ∗ which we call imbalance, that explicitly measures the imax value V (st). We will describe a policy with a performance of a simulation policy for Monte-Carlo small error as strong, and a policy with a small ex- evaluation. We introduce two new algorithms that pected error as balanced. Intuitively, a strong policy minimise the imbalance of a simulation policy by gradi- makes few mistakes, whereas a balanced policy allows ent descent. These algorithms require very little com- many mistakes, as long as they cancel each other out putation for each parameter update, and are able to on average. Formally, we define the strength J(θ) and learn expressive simulation policies with hundreds of k-step imbalance Bk(θ) of a policy πθ, parameters. 2 J(θ) = Eρ Eπθ δt jst = s (2) We evaluate our simulation balancing algorithms in 2 23 0 2k−1 31 the game of Go. We compare them to reinforcement X B (θ) = 6 δ js = s 7 (3) learning and supervised learning algorithms for max- k Eρ 4@Eπθ 4 t+j t 5A 5 imising strength, and to a well-known simulation pol- j=0 icy for this domain, handcrafted by trial and error. h 2i = ( [V ∗(s ) − V ∗(s )js = s]) The simulation policy learnt by our new algorithms Eρ Eπθ t+k t t significantly outperforms prior approaches. We consider two choices of k in this paper. The two- step imbalance B2(θ) is specifically appropriate to two- 2. Strength and Balance player games. It allows errors by one player, as long as We consider deterministic two-player games of finite they are on average cancelled out by the other player's error on the next move. The full imbalance B allows length with a terminal outcome or score z 2 R. During 1 simulation, move a is selected in state s according to errors to be committed at any time, as long as they cancel out by the time the game is finished. It is ex- a stochastic simulation policy πθ(s; a) with parameter vector θ, that is used to select moves for both players. actly equivalent to the mean squared bias that we are The goal is to find the parameter vector θ∗ that max- aiming to optimise in Equation 1, imises the overall playing strength of a player based h ∗ ∗ 2i B1(θ) = Eρ (Eπθ [V (sT ) − V (s)jst = s]) on Monte-Carlo search. Our approach is to make the h i Monte-Carlo evaluations in the search as accurate as ∗ 2 = Eρ (Eπθ [zjst = s] − V (s)) (4) possible, by minimising the mean squared error be- 1 N P where sT is the terminal state with outcome z. Thus, tween the estimated values V (s) = N i=1 zi and the minimax values V ∗(s). while the direct performance of a policy is largely de- termined by its strength, the performance of a policy When the number of simulations N is large, the mean in Monte-Carlo simulation is determined by its full im- squared error is dominated by the bias of the sim- balance. ulation policy with respect to the minimax value, ∗ ∗ If the simulation policy is optimal, π [zjs] = V (s), V (s) − Eπθ [zjs], and the variance of the estimate (i.e. E θ the error caused by only seeing a finite number of sim- then perfect balance is achieved, B1(θ) = 0. This ulations) can be ignored. Our objective is to minimise suggests that optimising the strength of the simula- the mean squared bias, averaged over the distribution tion policy, so that individual moves become closer to of states ρ(s) that are evaluated during Monte-Carlo optimal, may be sufficient to achieve balance. How- search. ever, even small errors can rapidly accumulate over the course of long simulations if they are not well- ∗ h ∗ 2i θ = argmin Eρ (V (s) − Eπθ [zjs]) (1) balanced. It is more important to maintain a diverse θ spread of simulations, which are on average representa- where Eρ denotes the expectation over the distribution tive of strong play, than for individual moves or sim- of actual states ρ(s), and Eπθ denotes the expectation ulations to be low in error. Figure 1 shows a sim- over simulations with policy πθ. ple scenario in which the error of each player is i.i.d Monte-Carlo Simulation Balancing 10 10 Simulations Simulations Mean Mean Monte-Carlo value Monte-Carlo value 5 5 0 0 Minimax value Minimax value -5 -5 -10 -10 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Time steps Time steps Figure 1. Monte-Carlo simulation in an artificial two-player game. 30 simulations of 100 time steps were executed from an initial state with minimax value 0. Each player selects moves imperfectly during simulation, with an error that is exponentially distributed with respect to the minimax value, with rate parameters λ1 and λ2 respectively. a) The simulation players are strong but imbalanced: λ1 = 10; λ2 = 5, b) the simulation players are weak but balanced: λ1 = 2; λ2 = 2.