Monte-Carlo Tree Search As Regularized Policy Optimization

Monte-Carlo tree search as regularized policy optimization Jean-Bastien Grill * 1 Florent Altche´ * 1 Yunhao Tang * 1 2 Thomas Hubert 3 Michal Valko 1 Ioannis Antonoglou 3 Remi´ Munos 1 Abstract AlphaZero employs an alternative handcrafted heuristic to achieve super-human performance on board games (Silver The combination of Monte-Carlo tree search et al., 2016). Recent MCTS-based MuZero (Schrittwieser (MCTS) with deep reinforcement learning has et al., 2019) has also led to state-of-the-art results in the led to significant advances in artificial intelli- Atari benchmarks (Bellemare et al., 2013). gence. However, AlphaZero, the current state- of-the-art MCTS algorithm, still relies on hand- Our main contribution is connecting MCTS algorithms, crafted heuristics that are only partially under- in particular the highly-successful AlphaZero, with MPO, stood. In this paper, we show that AlphaZero’s a state-of-the-art model-free policy-optimization algo- search heuristics, along with other common ones rithm (Abdolmaleki et al., 2018). Specifically, we show that such as UCT, are an approximation to the solu- the empirical visit distribution of actions in AlphaZero’s tion of a specific regularized policy optimization search procedure approximates the solution of a regularized problem. With this insight, we propose a variant policy-optimization objective. With this insight, our second of AlphaZero which uses the exact solution to contribution a modified version of AlphaZero that comes this policy optimization problem, and show exper- significant performance gains over the original algorithm, imentally that it reliably outperforms the original especially in cases where AlphaZero has been observed to algorithm in multiple domains. fail, e.g., when per-search simulation budgets are low (Ham- rick et al., 2020). In Section2, we briefly present MCTS with a focus on Al- 1. Introduction phaZero and provide a short summary of the model-free Policy gradient is at the core of many state-of-the-art deep policy-optimization. In Section3, we show that AlphaZero reinforcement learning (RL) algorithms. Among many suc- (and many other MCTS algorithms) computes approximate cessive improvements to the original algorithm (Sutton et al., solutions to a family of regularized policy optimization 2000), regularized policy optimization encompasses a large problems. With this insight, Section4 introduces a modified family of such techniques. Among them trust region policy version of AlphaZero which leverages the benefits of the optimization is a prominent example (Schulman et al., 2015; policy optimization formalism to improve upon the origi- 2017; Abdolmaleki et al., 2018; Song et al., 2019). These al- nal algorithm. Finally, Section5 shows that this modified gorithmic enhancements have led to significant performance algorithm outperforms AlphaZero on Atari games and con- gains in various benchmark domains (Song et al., 2019). tinuous control tasks. As another successful RL framework, the AlphaZero family arXiv:2007.12509v1 [cs.LG] 24 Jul 2020 of algorithms (Silver et al., 2016; 2017b;a; Schrittwieser 2. Background et al., 2019) have obtained groundbreaking results on chal- Consider a standard RL setting tied to a Markov decision lenging domains by combining classical deep learning (He process (MDP) with state space X and action space A. At a et al., 2016) and RL (Williams, 1992) techniques with discrete round t ≥ 0, the agent in state xt 2 X takes action Monte-Carlo tree search (Kocsis and Szepesvari´ , 2006). at 2 A given a policy at ∼ π(·|st), receives reward rt, To search efficiently, the MCTS action selection criteria and transitions to a next state xt+1 ∼ p(·|xt; at). The RL takes inspiration from bandits (Auer, 2002). Interestingly, problem consists in finding a policy which maximizes the P t *Equal contribution 1DeepMind, Paris, FR 2Columbia Univer- discounted cumulative return Eπ[ t≥0 γ rt] for a discount sity, New York, USA 3DeepMind, London, UK. Correspondence factor γ 2 (0; 1). To scale the method to large environments, to: Jean-Bastien Grill <[email protected]>. we assume that the policy πθ(ajx) is parameterized by a neural network θ. Proceedings of the 37 th International Conference on Machine Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by the author(s). MCTS as regularized policy optimization 2.1. AlphaZero updates the current policy πθ by solving a local maximiza- tion problem of the form We focus on the AlphaZero family, comprised of Al- T phaGo (Silver et al., 2016), AlphaGo Zero (Silver π 0 arg max Q y − R(y; π ); θ , πθ θ (2) et al., 2017b), AlphaZero (Silver et al., 2017a), and y2S MuZero (Schrittwieser et al., 2019), which are among the where Qπθ is an estimate of the Q-function, S is the jAj- most successful algorithms in combining model-free and 2 dimensional simplex and R : S ! R a convex regulariza- model-based RL. Although they make different assump- tion term (Neu et al., 2017; Grill et al., 2019; Geist et al., tions, all of these methods share the same underlying search 2019). Intuitively, Eq.2 updates πθ to maximize the value algorithm, which we refer to as AlphaZero for simplicity. QT y πθ while constraining the update with a regularization From a state x, AlphaZero uses MCTS (Browne et al., 2012) term R(y; πθ). to compute an improved policy π^(·|x) at the root of the Without regularizations, i.e., R = 0, Eq.2 reduces to policy search tree from the prior distribution predicted by a policy iteration (Sutton and Barto, 1998). When π is updated 1 θ network πθ(·|x) ; see Eq.3 for the definition. This im- using a single gradient ascent step towards the solution proved policy is then distilled back into πθ by updating θ of Eq.2, instead of using the solution directly, the above as θ θ − ηrθEx[D(^π(·|x); πθ(·|x))] for a certain di- formulation reduces to (regularized) policy gradient (Sutton vergence D. In turn, the distilled parameterized policy πθ et al., 2000; Levine, 2018). informs the next local search by predicting priors, further improving the local policy over successive iterations. There- Interestingly, the regularization term has been found to sta- fore, such an algorithmic procedure is a special case of bilize, and possibly to speed up the convergence of πθ. generalized policy improvement (Sutton and Barto, 1998). For instance, trust region policy search algorithms (TRPO, Schulman et al., 2015; MPO Abdolmaleki et al., 2018; V- One of the main differences between AlphaZero and previ- MPO, Song et al., 2019), set R to be the KL-divergence ous MCTS algorithms such as UCT (Kocsis and Szepesvari´ , between consecutive policies KL[y; πθ]; maximum entropy 2006) is the introduction of a learned prior πθ and value RL (Ziebart, 2010; Fox et al., 2015; O’Donoghue et al., function vθ. Additionally, AlphaZero’s search procedure 2016; Haarnoja et al., 2017) sets R to be the negative en- applies the following action selection heuristic, tropy of y to avoid collapsing to a deterministic policy. " pP # b n(x; b) arg max Q(x; a) + c · πθ(ajx) · , (1) 3. MCTS as regularized policy optimization a 1 + n(x; a) In Section2, we presented AlphaZero that relies on model- where c is a numerical constant,2 n(x; a) is the number of based planning. We also presented policy optimization, a times that action a has been selected from state x during framework that has achieved good performance in model- search, and Q(x; a) is an estimate of the Q-function for free RL. In this section, we establish our main claim state-action pair (x; a) computed from search statistics and namely that AlphaZero’s action selection criteria can be interpreted as approximating the solution to a regularized using vθ for bootstrapping. policy-optimization objective. Intuitively, this selection criteria balances exploration and exploitation, by selecting the most promising actions (high 3.1. Notation Q-value Q(x; a) and prior policy πθ(ajx)) or actions that have rarely been explored (small visit count n(x; a)). We First, let us define the empirical visit distribution π^ as denote by Nsim the simulation budget, i.e., the search is 1 + n(x; a) run with N simulations. A more detailed presentation of π^(ajx) · (3) sim , jAj + P n(x; b) AlphaZero is in AppendixA; for a full description of the b algorithm, refer to Silver et al.(2017a). Note that in Eq.3, we consider an extra visit per action compared to the acting policy and distillation target in the 2.2. Policy optimization original definition (Silver et al., 2016). This extra visit is introduced for convenience in the upcoming analysis (to Policy optimization aims at finding a globally optimal pol- avoid divisions by zero) and does not change the generality π icy θ, generally using iterative updates. Each iteration of our results. 1 We note here that terminologies such as prior follow Silver We also define the multiplier λ as et al.(2017a) and do not relate to concepts in Bayesian statistics. N 2 Schrittwieser et al.(2019) uses a c that has a slow-varying pP n P b b , dependency on b n(x; b), which we omit here for simplicity, as λN (x) , c · P (4) it was the case of Silver et al.(2017a). jAj + b nb MCTS as regularized policy optimization where the shorthand notation na is used for n(x; a), and Remark The factor λN isp a decreasing function of N. P ~ N(x) , b nb denotes the number of visits to x during Asymptotically, λN = O(1= N). Therefore, the influence search. With this notation, the action selection formula of of the regularization term decreases as the number of simu- Eq.1 can be written as selecting the action a? such that lation increases, which makes π¯ rely increasingly more on search Q-values q and less on the policy prior πθ.

Monte-Carlo Tree Search As Regularized Policy Optimization

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support