Monte-Carlo tree search as regularized policy optimization

Jean-Bastien Grill * 1 Florent Altche´ * 1 Yunhao Tang * 1 2 Thomas Hubert 3 Michal Valko 1 Ioannis Antonoglou 3 Remi´ Munos 1

Abstract AlphaZero employs an alternative handcrafted heuristic to achieve super-human performance on board games (Silver The combination of Monte-Carlo tree search et al., 2016). Recent MCTS-based MuZero (Schrittwieser (MCTS) with deep has et al., 2019) has also led to state-of-the-art results in the led to significant advances in artificial intelli- Atari benchmarks (Bellemare et al., 2013). gence. However, AlphaZero, the current state- of-the-art MCTS , still relies on hand- Our main contribution is connecting MCTS , crafted heuristics that are only partially under- in particular the highly-successful AlphaZero, with MPO, stood. In this paper, we show that AlphaZero’s a state-of-the-art model-free policy-optimization algo- search heuristics, along with other common ones rithm (Abdolmaleki et al., 2018). Specifically, we show that such as UCT, are an approximation to the solu- the empirical visit distribution of actions in AlphaZero’s tion of a specific regularized policy optimization search procedure approximates the solution of a regularized problem. With this insight, we propose a variant policy-optimization objective. With this insight, our second of AlphaZero which uses the exact solution to contribution a modified version of AlphaZero that comes this policy optimization problem, and show exper- significant performance gains over the original algorithm, imentally that it reliably outperforms the original especially in cases where AlphaZero has been observed to algorithm in multiple domains. fail, e.g., when per-search simulation budgets are low (Ham- rick et al., 2020). In Section2, we briefly present MCTS with a focus on Al- 1. Introduction phaZero and provide a short summary of the model-free Policy gradient is at the core of many state-of-the-art deep policy-optimization. In Section3, we show that AlphaZero reinforcement learning (RL) algorithms. Among many suc- (and many other MCTS algorithms) computes approximate cessive improvements to the original algorithm (Sutton et al., solutions to a family of regularized policy optimization 2000), regularized policy optimization encompasses a large problems. With this insight, Section4 introduces a modified family of such techniques. Among them trust region policy version of AlphaZero which leverages the benefits of the optimization is a prominent example (Schulman et al., 2015; policy optimization formalism to improve upon the origi- 2017; Abdolmaleki et al., 2018; Song et al., 2019). These al- nal algorithm. Finally, Section5 shows that this modified gorithmic enhancements have led to significant performance algorithm outperforms AlphaZero on Atari games and con- gains in various benchmark domains (Song et al., 2019). tinuous control tasks. As another successful RL framework, the AlphaZero family of algorithms (Silver et al., 2016; 2017b;a; Schrittwieser 2. Background et al., 2019) have obtained groundbreaking results on chal- Consider a standard RL setting tied to a Markov decision lenging domains by combining classical (He process (MDP) with state space X and action space A. At a et al., 2016) and RL (Williams, 1992) techniques with discrete round t ≥ 0, the agent in state xt ∈ X takes action Monte-Carlo tree search (Kocsis and Szepesvari´ , 2006). at ∈ A given a policy at ∼ π(·|st), receives reward rt, To search efficiently, the MCTS criteria and transitions to a next state xt+1 ∼ p(·|xt, at). The RL takes inspiration from bandits (Auer, 2002). Interestingly, problem consists in finding a policy which maximizes the P t *Equal contribution 1DeepMind, Paris, FR 2Columbia Univer- discounted cumulative return Eπ[ t≥0 γ rt] for a discount sity, New York, USA 3DeepMind, London, UK. Correspondence factor γ ∈ (0, 1). To scale the method to large environments, to: Jean-Bastien Grill . we assume that the policy πθ(a|x) is parameterized by a neural network θ. Proceedings of the 37 th International Conference on , Vienna, Austria, PMLR 119, 2020. Copyright 2020 by the author(s). MCTS as regularized policy optimization

2.1. AlphaZero updates the current policy πθ by solving a local maximiza- tion problem of the form We focus on the AlphaZero family, comprised of Al- T phaGo (Silver et al., 2016), AlphaGo Zero (Silver π 0 arg max Q y − R(y, π ), θ , πθ θ (2) et al., 2017b), AlphaZero (Silver et al., 2017a), and y∈S MuZero (Schrittwieser et al., 2019), which are among the where Qπθ is an estimate of the Q-function, S is the |A|- most successful algorithms in combining model-free and 2 dimensional simplex and R : S → R a convex regulariza- model-based RL. Although they make different assump- tion term (Neu et al., 2017; Grill et al., 2019; Geist et al., tions, all of these methods share the same underlying search 2019). Intuitively, Eq.2 updates πθ to maximize the value algorithm, which we refer to as AlphaZero for simplicity. QT y πθ while constraining the update with a regularization From a state x, AlphaZero uses MCTS (Browne et al., 2012) term R(y, πθ). to compute an improved policy πˆ(·|x) at the root of the Without regularizations, i.e., R = 0, Eq.2 reduces to policy search tree from the prior distribution predicted by a policy iteration (Sutton and Barto, 1998). When π is updated 1 θ network πθ(·|x) ; see Eq.3 for the definition. This im- using a single gradient ascent step towards the solution proved policy is then distilled back into πθ by updating θ of Eq.2, instead of using the solution directly, the above as θ ← θ − η∇θEx[D(ˆπ(·|x), πθ(·|x))] for a certain di- formulation reduces to (regularized) policy gradient (Sutton vergence D. In turn, the distilled parameterized policy πθ et al., 2000; Levine, 2018). informs the next local search by predicting priors, further improving the local policy over successive iterations. There- Interestingly, the regularization term has been found to sta- fore, such an algorithmic procedure is a special case of bilize, and possibly to speed up the convergence of πθ. generalized policy improvement (Sutton and Barto, 1998). For instance, trust region policy search algorithms (TRPO, Schulman et al., 2015; MPO Abdolmaleki et al., 2018; V- One of the main differences between AlphaZero and previ- MPO, Song et al., 2019), set R to be the KL-divergence ous MCTS algorithms such as UCT (Kocsis and Szepesvari´ , between consecutive policies KL[y, πθ]; maximum entropy 2006) is the introduction of a learned prior πθ and value RL (Ziebart, 2010; Fox et al., 2015; O’Donoghue et al., function vθ. Additionally, AlphaZero’s search procedure 2016; Haarnoja et al., 2017) sets R to be the negative en- applies the following action selection heuristic, tropy of y to avoid collapsing to a deterministic policy. " pP # b n(x, b) arg max Q(x, a) + c · πθ(a|x) · , (1) 3. MCTS as regularized policy optimization a 1 + n(x, a) In Section2, we presented AlphaZero that relies on model- where c is a numerical constant,2 n(x, a) is the number of based planning. We also presented policy optimization, a times that action a has been selected from state x during framework that has achieved good performance in model- search, and Q(x, a) is an estimate of the Q-function for free RL. In this section, we establish our main claim state-action pair (x, a) computed from search statistics and namely that AlphaZero’s action selection criteria can be interpreted as approximating the solution to a regularized using vθ for bootstrapping. policy-optimization objective. Intuitively, this selection criteria balances exploration and exploitation, by selecting the most promising actions (high 3.1. Notation Q-value Q(x, a) and prior policy πθ(a|x)) or actions that have rarely been explored (small visit count n(x, a)). We First, let us define the empirical visit distribution πˆ as denote by Nsim the simulation budget, i.e., the search is 1 + n(x, a) run with N simulations. A more detailed presentation of πˆ(a|x) · (3) sim , |A| + P n(x, b) AlphaZero is in AppendixA; for a full description of the b algorithm, refer to Silver et al.(2017a). Note that in Eq.3, we consider an extra visit per action compared to the acting policy and distillation target in the 2.2. Policy optimization original definition (Silver et al., 2016). This extra visit is introduced for convenience in the upcoming analysis (to Policy optimization aims at finding a globally optimal pol- avoid divisions by zero) and does not change the generality π icy θ, generally using iterative updates. Each iteration of our results. 1 We note here that terminologies such as prior follow Silver We also define the multiplier λ as et al.(2017a) and do not relate to concepts in Bayesian statistics. N 2 Schrittwieser et al.(2019) uses a c that has a slow-varying pP n P b b , dependency on b n(x, b), which we omit here for simplicity, as λN (x) , c · P (4) it was the case of Silver et al.(2017a). |A| + b nb MCTS as regularized policy optimization where the shorthand notation na is used for n(x, a), and Remark The factor λN is√ a decreasing function of N. P ˜ N(x) , b nb denotes the number of visits to x during Asymptotically, λN = O(1/ N). Therefore, the influence search. With this notation, the action selection formula of of the regularization term decreases as the number of simu- Eq.1 can be written as selecting the action a? such that lation increases, which makes π¯ rely increasingly more on   search Q-values q and less on the policy prior πθ. As we ? πθ(a|x) explain next, λ follows the design choice of AlphaZero, a (x) , arg max Q(x, a) + λN · · (5) N a πˆ(a|x) and may be justified by a similar choice done in bandits Note that in Eq.5 and in the rest of the paper (unless oth- (Bubeck et al., 2012). erwise specified), we use Q to denote the search Q-values, i.e., those estimated by the as presented 3.3. AlphaZero as policy optimization in Section 2.1. For more compact notation, we use bold We now analyze the action selection formula of AlphaZero fonts to denote vector quantities, with the convention that (Eq.1). Interestingly, we show that this formula, which u u[a] 5 v [a] = v[a] for two vectors u and v with the same dimen- was handcrafted independently of the policy optimization sion. Additionally, we omit the dependency of quantities research, turns out to result in a distribution πˆ that closely on state x when the context is clear. In particular, we use relates to the policy optimization solution π¯. q ∈ R|A| to denote the vector of search Q-function Q(x, a) such that q = Q(x, a). With this notation, we can rewrite The main formal claim of this section that AlphaZero’s a πˆ tracks π¯ the action selection formula of Eq.5 simply as 3 search policy the exact solution of the regularized policy optimization problem of Definition1. We show that ? h πθ i Proposition1 and Proposition2 support this claim from two a , arg max q + λN · (6) πˆ complementary perspectives.

3.2. A related regularized policy optimization problem First, with Proposition1, we show that πˆ approximately follows the gradient of the concave objective for which π¯ is We now define π¯ as the solution to a regularized policy the optimum. optimization problem; we will see in the next subsection Proposition 1. For any action a ∈ A, visit count n ∈ RA, that the visit distribution πˆ is a good approximation of π¯. policy prior πθ > 0 and Q-values q, Definition 1 (π¯). Let π¯ be the solution to the following  ∂  objective ? T a = arg max (q πˆ − λN KL[πθ, πˆ]) , (9) a ∂na T π¯ , arg max[q y − λN KL[πθ, y]], (7) ? y∈S with a being the action realizing Eq.1 as defined in Eq.5 P and πˆ = (1 + n)/(|A| + b nb) as defined in Eq.3, is a where S is the |A|−dimensional simplex and KL is the function of the count vector extended to real values. KL-divergence.4 The only thing that the search algorithm eventually influ- We can see from Eq.2 and Definition1 that π¯ is the solution ences through the tree search is the visit count distribution. to a policy optimization problem where Q is set to the search If we could do an infinitesimally small update, then the Q-values, and the regularization term R is a reversed KL- greedy update maximizing Eq.8 would be in the direction divergence weighted by factor λN . of the partial derivative of Eq.9. However, as we are re- In addition, note that π¯ is as a smooth version of the arg max stricted by a discrete update, then increasing the visit count associated to the search Q-values q. In fact, π¯ can be com- as in Proposition1 makes πˆ track π¯. Below, we further ? puted as (Appendix B.3 gives a detailed derivation of π¯) characterize the selected action a and assume πθ > 0. Proposition 2. The action a? realizing Eq.1 is such that πθ π¯ = λN , (8) α − q πˆ(a?|x) ≤ π¯(a?|x). (10) where α ∈ is such that π¯ is a proper probability vec- R ? tor. This is slightly different from the softmax distribution To acquire intuition from Proposition2, note that once a ? obtained with KL[y, πθ], which is written as is selected, its count na increases and so does the total count N. As a result, πˆ(a?) increases (in the order of   ? T q O(1/N)) and further approximates π¯(a ). As such, Propo- arg max[q y − λN KL[y, πθ]] ∝ πθ exp · y∈S λN sition2 shows that the action selection formula encourages 3 |A| When the context is clear, we simplify for any x ∈ R , that 5Nonetheless, this heuristic could be interpreted as loosely arg max [x] , arg maxa {x[a], a ∈ A}. inspired by bandits (Rosin, 2011), but was adapted to accommodate 4 P x[a] We apply the definition KL[x, y] , a x[a] log y[a] · a prior term πθ. MCTS as regularized policy optimization

X √ 6 the shape of πˆ to be close to that of π¯, until in the limit the where D(x, y) , 2 − 2 xi · yi is an f-divergence two distributions coincide. i Note that Proposition1 and Proposition2 are a special s log P n case of a more general result that we formally prove in and λUCT(x) c · b b · N , |A| + P n Appendix D.1. In this particular case, the proof relies on b b noticing that  √  " pP # Similar to AlphaZero, λUCT behaves7 as O˜ 1/ N and b nb N arg max qa + c · πθ(a) · (1) a 1 + na therefore the regularization gets weaker as N increases. We can also derive tracking properties between π¯UCT and   1 1  = arg max πθ(a) · − · (11) the UCT empirical visit distribution πˆUCT as we did for a πˆ(a) π¯(a) AlphaZero in the previous section, with Proposition3; as P P in the previous section, this is a special case of the general Then, since a πˆ(a) = a π¯(a) and πˆ > 0 and π¯ > 0, there exists at least one action for which 0 < πˆ(a) ≤ π¯(a), result with any f-divergence in Appendix D.1. i.e., 1/πˆ(a) − 1/π¯(a) ≥ 0. Proposition 3. We have that To state a formal statement on πˆ approximating π¯, in Ap-   s P pendix D.3 we expand the conclusion under the assumption log( b nb) arg maxqa + c · πθ(a) ·  that π¯ is a constant. In this case we can derive a bound a 1 + na for the convergence rate of these two distributions as N " !# increases over the search, p 1 1 = arg max πθ(a) · p − p |A| − 1 a πˆ(a) π¯UCT(a) ||π¯ − πˆ|| ≤ , (12) ∞ |A| + N and with O(1/N) matching the lowest possible approximation  ∂  error (see Appendix D.3) among discrete distributions of ? T UCT  aUCT = arg max q πˆ − λN D[πθ, πˆ] . (16) a ∂n the form (ki/N)i for ki ∈ N. a

3.4. Generalization to common MCTS algorithms To sum up, similar to the previous section, we show that UCT’s search policy πˆUCT tracks the exact solution π¯UCT of Besides AlphaZero, UCT (Kocsis and Szepesvari´ , 2006) is the regularized policy optimization problem of Eq. 15. another heuristic with a selection criteria inspired by UCB, defined as   4. Algorithmic benefits s P log( b nb) arg max qa + c · · (13) In Section3, we introduced a distribution π¯ as the solu-  1 + n  a a tion to a regularized policy optimization problem. We then showed that AlphaZero, along with general MCTS algo- Contrary to AlphaZero, the standard UCT formula does rithms, select actions such that the empirical visit distribu- not involve a prior policy. In this section, we consider a tion πˆ actively approximates π¯. Building on this insight, slightly modified version of UCT with a (learned) prior πθ, below we argue that π¯ is preferable to πˆ, and we propose as defined in Eq. 14. By setting the prior πθ to the uniform three complementary algorithmic changes to AlphaZero. distribution, we recover the original UCT formula,   s P 4.1. Advantages of using π¯ over πˆ log( b nb) arg maxqa + c · πθ(a) · · (14) a 1 + na MCTS algorithms produce Q-values as a by-product of the search procedure. However, MCTS does not directly use Using the same reasoning as in Section 3.3, we now show search Q-values to compute the policy, but instead uses the that this modified UCT formula also tracks the solution to a visit distribution πˆ (search Q-values implicitly influence πˆ regularized policy optimization problem, thus generalizing by guiding the search). We postulate that this degrades the our result to commonly used MCTS algorithms. performance especially at low simulation budgets Nsim for several reasons: First, we introduce π¯UCT, which is tracked by the UCT visit distribution, as: 6In particular D(x, y) ≥ 0,D(x, y) = 0 =⇒ x = y and D(x, y) is jointly convex in x and y (Csiszar´ , 1964; Liese and T UCT π¯UCT , arg max q y − λN D(y, πθ), (15) Vajda, 2006). y∈S 7We ignore logarithmic terms. MCTS as regularized policy optimization

1. When a promising new (high-value) leaf is discovered, many additional simulations might be needed before this information is reflected in πˆ; since π¯ is directly computed from Q-values, this information is updated instantly. 2. By definition (Eq.3), πˆ is the ratio of two integers and has limited expressiveness when Nsim is low, which might lead to a sub-optimal policy; π¯ does not have this constraint.

3. The prior πθ is trained against the target πˆ, but the latter is only improved for actions that have been sampled at least once during search. Due to the deterministic action selection (Eq.1), this may be problematic for certain actions that would require a large simulation Figure 1. Comparison of the score (median score over 3 seeds) of budget to be sampled even once. MuZero (red: using πˆ) and ALL (blue: using π¯) after 100k learner steps as a function of Nsim on Cheetah Run of the Control Suite. The above downsides cause MCTS to be highly sensitive policy into πθ. We propose to use π¯ as the target policy to simulation budgets Nsim. When Nsim is high relative to the branching factor of the tree search, i.e., number of in place of πˆ to train our prior policy. As a result, the actions, MCTS algorithms such as AlphaZero perform well. parameters are updated as However, this performance drops significantly when N h i sim θ ← θ − η∇ [¯π(·|x ), π (·|x )] , (17) is low as showed by Hamrick et al.(2020); see also e.g., θExroot KL root θ root Figure 3.D. by Schrittwieser et al.(2019). where xroot is sampled from a prioritized replay buffer as in We illustrate the effect of simulation budgets in Figure1, AlphaZero. We label this variant as LEARN. where x-axis shows the budgets Nsim and y-axis shows the episodic performance of algorithms applying πˆ vs. π¯; see ALL: combining them all We refer to the combination the details of these algorithms in the following sections. We of these three independent variants as ALL. AppendixB see that πˆ is highly sensitive to simulation budgets while π¯ provides additional implementation details. performs consistently well across all budget values. Remark Note that AlphaZero entangles search and learn- 4.2. Proposed improvements to AlphaZero ing, which is not desirable. For example, when the ac- tion selection formula changes, this impacts not only inter- We have pointed out potential issues due to πˆ. We now mediate search results but also the root visit distribution 8 detail how to use π¯ as a replacement to resolve such issues. πˆ(·|xroot), which is also the learning target for πθ. However, Appendix B.3 shows how to compute π¯ in practice. the LEARN variant partially disentangles these components. Indeed, the new learning target is π¯(·|xroot) which is com- ACT: acting with π¯ AlphaZero acts in the real environ- puted from search Q-values, rendering it less sensitive to ment by sampling actions according to a ∼ πˆ(·|xroot). In- e.g., the action selection formula. stead, we propose to to sample actions sampling according to a ∼ π¯(·|xroot). We label this variant as ACT. 4.3. Connections between AlphaZero and model-free policy optimization. SEARCH: searching with π¯ During search, we propose to stochastically sample actions according to π¯ instead of Next, we make the explicit link between proposed algorith- the deterministic action selection rule of Eq.1. At each mic variants and existing policy optimization algorithms. node x in the tree, π¯(·) is computed with Q-values and total First, we provide two complementary interpretations. visit counts at the node based on Definition1. We label this variant as SEARCH. LEARN as policy optimization For this interpretation, we treat SEARCH as a blackbox, i.e., a subroutine that takes LEARN: learning with π¯ AlphaZero computes locally a root node x and returns statistics such as search Q-values. improved policy with tree search and distills such improved Recall that policy optimization (Eq.2) maximizes the objec- T 8 tive ≈ Q y with the local policy y. There are many model- Recall that we have identified three issues. Each algorithmic πθ variant below helps in addressing issue 1 and 2. Furthermore, the free methods for the estimation of Qπθ , ranging from Monte- πθ P t LEARN variant helps address issue 3. Carlo estimates of cumulative returns Q ≈ t≥0 γ rt MCTS as regularized policy optimization

(Schulman et al., 2015; 2017) to using predictions from a

Q-value critic Qπθ ≈ qθ trained with off-policy samples (Abdolmaleki et al., 2018; Song et al., 2019). When solv- ing π¯ for the update (Eq. 17), we can interpret LEARN as a policy optimization algorithm using tree search to estimate

Qπθ . Indeed, LEARN could be interpreted as building a 9 Q-function critic qθ with a tree-structured inductive bias. However, this inductive bias is not built-in a network ar- chitecture (Silver et al., 2017c; Farquhar et al., 2017; Oh et al., 2017; Guez et al., 2018), but constructed online by an algorithm, i.e., MCTS. Next, LEARN computes the locally optimal policy π¯ to the regularized policy optimization ob- jective and distills π¯ into πθ. This is exactly the approach taken by MPO (Abdolmaleki et al., 2018).

Figure 2. Comparison of median scores of MuZero (red) and ALL (blue) at N = 5 (dotted line) and N = 50 (solid line) simula- SEARCH as policy optimization We now unpack the al- sim sim tions per step on Ms Pacman (Atari). Averaged across 8 seeds. gorithmic procedure of the tree search, and show that it can also be interpreted as policy optimization.

During the forward simulation phase of SEARCH, the action 5. Experiments at each node x is selected by sampling a ∼ π¯(·|x). As a In this section, we aim to address several questions: (1) result, the full imaginary trajectory is generated consistently How sensitive are state-of-the-art hybrid algorithms such according to policy π¯. During backward updates, each en- as AlphaZero to low simulation budgets and can the ALL countered node x receives a backup value from its child variant provide a more robust alternative? (2) What changes node, which is an exact estimate of Qπ¯ (x, a). Finally, the among ACT, SEARCH, and LEARN are most critical in this local policy π¯(·|x) is updated by solving the constrained op- variant performance? (3) How does the performance of the timization problem of Definition1, leading to an improved ALL variant compare with AlphaZero in environments with policy over previous π¯(·|x). Overall, with N simulated sim large branching factors? trajectories, SEARCH optimizes the root policy π¯(·|xroot) and root search Q-values, by carrying out Nsim sequences of MPO-style updates across the entire tree.10 A highly related Baseline algorithm Throughout the experiments, we take approach is to update local policies via policy gradients MuZero (Schrittwieser et al., 2019) as the baseline algo- (Anthony et al., 2019). rithm. As a variant of AlphaZero, MuZero applies tree By combining the above two interpretations, we see that search in learned models instead of real environments, the ALL variant is very similar to a full policy optimization which makes it applicable to a wider range of problems. algorithm. Specifically, on a high level, ALL carries out Since MuZero shares the same search procedure as Al- MPO updates with search Q-values. These search Q-values phaGo, AlphaGo Zero, and AlphaZero, we expect the per- are also themselves obtained via MPO-style updates within formance gains to be transferable to these algorithms. Note the tree search. This paves the way to our major revelation that the results below were obtained with a scaled-down stated next. version of MuZero, which is described in Appendix B.1.

Observation 1. ALL can be interpreted as regularized pol- icy optimization. Further, since πˆ approximates π¯, Alp- Hyper-parameters The hyper-parameters of the algo- haZero and other MCTS algorithms can be interpreted as rithms are tuned to achieve the maximum possible perfor- approximate regularized policy optimization. mance for baseline MuZero on the Ms Pacman level of the Atari suite (Bellemare et al., 2013), and are identical in all 9During search, because child nodes have fewer simulations experiments with the exception of the number of simulations 11 than the root, the Q-function estimate at the root slightly under- per step Nsim. In particular, no further tuning was required estimates the acting policy Q-function. for the LEARN, SEARCH, ACT, and ALL variants, as was 10Note that there are several differences from typical model-free expected from the theoretical considerations of Section3. implementations of policy optimization: most notably, unlike a 11 fully-parameterized policy, the tree search policy is tabular at each The number of actors is scaled linearly with Nsim to maintain node. This also entails that the MPO-style distillation is exact. the same total number of generated frames per second. MCTS as regularized policy optimization

(a) Alien (b) Asteroids

(a) 5 simulations

(c) Gravitar (d) Seaquest

Figure 3. Comparison of median score (solid lines) over 6 seeds of MuZero and ALL on four Atari games with Nsim = 50. The shaded areas correspond to the range of the best and worst seed. ALL (blue) performs consistently better than MuZero (red).

5.1. Search with low simulation budgets Since AlphaZero solely relies on the πˆ for training targets, it may misbehave when simulation budgets Nsim are low. On the other hand, our new algorithmic variants might perform better in this regime. To confirm these hypotheses, we compare the performance of MuZero and the ALL variant (b) 50 simulations on the Ms Pacman level of the Atari suite at different levels Figure 4. Ablation study at 5 and 50 simulations per step on Ms of simulation budgets. Pacman (Atari); average across 8 seeds.

Result In Figure2, we compare the episodic return of ALL vs. MuZero averaged over 8 seeds, with a simulation budget of levels are selected based on the experiment setup in Fig- Nsim = 5 and Nsim = 50 for an action set of size |A| ≤ 18; ure S1 of Schrittwieser et al.(2019). Importantly, note that thus, we consider that Nsim = 5 and Nsim = 50 respectively the performance gains of ALL are consistently significant correspond to a low and high simulation budgets relative across selected levels, even at a higher simulation budget of to the number of actions. We make several observations: Nsim = 50. (1) At a relatively high simulation budget, Nsim = 50, same as Schrittwieser et al.(2019), both MuZero and ALL ex- 5.2. Ablation study hibit reasonably close levels of performance; though ALL obtains marginally better performance than MuZero; (2) At To better understand which component of the ALL con- tributes the most to the performance gains, Figure4 presents low simulation budget, Nsim = 5, though both algorithms suffer in performance relative to high budgets, ALL signifi- the results of an ablation study where we compare individual cantly outperforms MuZero both in terms of learning speed component LEARN,SEARCH, or ACT. and asymptotic performance; (3) Figure6 in Appendix C.1 shows that this behavior is consistently observed at interme- Result The comparison is shown in Figure4, we make diate simulation budgets, with the two algorithms starting several observations: (1) At Nsim = 5 (Figure 4a), the main to reach comparable levels of performance when Nsim ≥ 24 improvement comes from using the policy optimization solu- simulations. These observations confirm the intuitions from tion π¯ as the learning target (LEARN variant); using π¯ during Section3. (4) We provide results on a subset of Atari games search or acting leads to an additional marginal improve- in Figure3, which show that the performance gains due to π¯ ment; (2) Interestingly, we observe a different behavior at over πˆ are also observed in other levels than Ms Pacman; see Nsim = 50 (Figure 4b). In this case, using π¯ for learning or Appendix C.2 for results on additional levels. This subset acting does not lead to a noticeable improvement. However, MCTS as regularized policy optimization the superior performance of ALL is mostly due to sampling according to π¯ during search (SEARCH). The improved performance when using π¯ as the learning target (LEARN) illustrates the theoretical considerations of Section3: at low simulation budgets, the discretization noise in πˆ makes it a worse training target than π¯, but this advantage vanishes when the number of simulations per step increases. As predicted by the theoretical results of Sec- tion3, learning and acting using π¯ and πˆ becomes equivalent when the simulation budget increases. On the other hand, we see a slight but significant improve- ment when sampling the next node according to π¯ during search (SEARCH) regardless of the simulation budget. This could be explained by the fact that even at high simulations budget, the SEARCH modification also affect deeper node Figure 5. Comparison of the median score over 3 seeds of MuZero that have less simulations. (red) and ALL (blue) at 4 (dotted) and 50 (solid line) simulations per step on Cheetah Run (Control Suite). 5.3. Search with large action space – continuous control The previous results confirm the intuitions presented in that ALL outperforms the original MuZero at low simulation budgets and still achieves faster convergence to the same Sections3 and4; namely, the ALL variation greatly im- proves performance at low simulation budgets, and obtain asymptotic performance with more simulations. Figure1 marginally higher performance at high simulation budgets. compares the asymptotic performance of MuZero and ALL Since simulation budgets are relative to the number of action, as a function of the simulation budget at 100k learner steps. these improvements are critical in tasks with a high number of actions, where MuZero might require a prohibitively high 6. Conclusion simulation budgets; prior work (Dulac-Arnold et al., 2015; Metz et al., 2017; Van de Wiele et al., 2020) has already In this paper, we showed that the action selection formula identified continuous control tasks as an interesting testbed. used in MCTS algorithms, most notably AlphaZero, approx- imates the solution to a regularized policy optimization prob- lem formulated with search Q-values. From this theoretical Benchmarks We select high-dimensional environments insight, we proposed variations of the original AlphaZero from DeepMind Control Suite (Tassa et al., 2018). The algorithm by explicitly using the exact policy optimization m observations are images and action space A = [−1, 1] solution instead of the approximation. We show experimen- with m dimensions. We apply an action discretization tally that these variants achieve much higher performance method similar to that of Tang and Agrawal(2019). In at low simulation budget, while also providing statistically short, for a continuous action space m dimensions, each di- significant improvements when this budget increases. mension is discretized into K evenly spaced atomic actions. With proper parameterization of the policy network (see, Our analysis on the behavior of model-based algorithms e.g., Appendix B.2), we can reduce the effective branching (i.e., MCTS) has made explicit connections to model-free factor to Km  Km, though this still results in a much algorithms. We hope that this sheds light on new ways of larger action space than Atari benchmarks. In Appendix C.2, combining both paradigms and opens doors to future ideas we provide additional descriptions of the tasks. and improvements.

Acknowledgements The authors would like to thank Alaa Result In Figure5, we compare MuZero with the ALL Saade, Bernardo Avila Pires, Bilal Piot, Corentin Tal- variant on the CheetahRun environment of the DeepMind lec, Daniel Guo, David Silver, Eugene Tarassov, Florian Control Suite (Tassa et al., 2018). We evaluate the perfor- Strub, Jessica Hamrick, Julian Schrittwieser, Katrina McK- mance at low (N = 4), medium (N = 12) and “high” sim sim inney, Mohammad Gheshlaghi Azar, Nathalie Beauguer- (N = 50) simulation budgets, for an effective action sim lange, Pierre Menard,´ Shantanu Thakoor, Theophane´ Weber, space of size 30 (m = 6, K = 5). The horizontal line Thomas Mesnard, Toby Pohlen and the rest of the Deep- corresponds to the performance of model-free D4PG also Mind team. trained on pixel observations (Barth-Maron et al., 2018), as reported in (Tassa et al., 2018). Appendix C.2 provides experimental results on additional tasks. We again observe MCTS as regularized policy optimization

References Farquhar, G., Rocktaschel,¨ T., Igl, M., and Whiteson, S. (2017). TreeQN and ATreeC: Differentiable tree- Abdolmaleki, A., Springenberg, J. T., Tassa, Y., Munos, structured models for deep reinforcement learning. arXiv R., Heess, N., and Riedmiller, M. (2018). Maxi- preprint arXiv:1710.11417. mum a posteriori policy optimisation. arXiv preprint arXiv:1806.06920. Fox, R., Pakman, A., and Tishby, N. (2015). Taming the noise in reinforcement learning via soft updates. arXiv Andrychowicz, O. M., Baker, B., Chociej, M., Jozefowicz, preprint arXiv:1512.08562. R., McGrew, B., Pachocki, J., Petron, A., Plappert, M., Powell, G., Ray, A., et al. (2020). Learning dexterous in- Geist, M., Scherrer, B., and Pietquin, O. (2019). A theory hand manipulation. The International Journal of Robotics of regularized markov decision processes. arXiv preprint Research, 39(1):3–20. arXiv:1901.11275. Anthony, T., Nishihara, R., Moritz, P., Salimans, T., and Google (2020). Cloud TPU — Google Cloud. Schulman, J. (2019). Policy gradient search: Online https://cloud.google.com/tpu/. planning and expert iteration without search trees. arXiv preprint arXiv:1904.03646. Grill, J.-B., Domingues, O. D., Menard,´ P., Munos, R., and Valko, M. (2019). Planning in entropy-regularized Auer, P. (2002). Using confidence bounds for exploitation- Markov decision processes and games. In Neural Infor- exploration trade-offs. Journal of Machine Learning mation Processing Systems. Research, 3(Nov):397–422. Guez, A., Weber, T., Antonoglou, I., Simonyan, K., Barth-Maron, G., Hoffman, M. W., Budden, D., Dabney, W., Vinyals, O., Wierstra, D., Munos, R., and Silver, D. Horgan, D., TB, D., Muldal, A., Heess, N., and Lillicrap, (2018). Learning to search with mctsnets. arXiv preprint T. (2018). Distributional policy gradients. In Interna- arXiv:1802.04697. tional Conference on Learning Representations. Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. (2017). Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. Reinforcement learning with deep energy-based policies. (2013). The arcade learning environment: An evalua- In Proceedings of the 34th International Conference on tion platform for general agents. Journal of Artificial Machine Learning-Volume 70, pages 1352–1361. JMLR. Intelligence Research, 47:253–279. org. Boyd, S. and Vandenberghe, L. (2004). Convex optimization. Hamrick, J. B., Bapst, V., Sanchez-Gonzalez, A., Pfaff, Cambridge university press. T., Weber, T., Buesing, L., and Battaglia, P. W. (2020). Combining Q-learning and search with amortized value Browne, C. B., Powley, E., Whitehouse, D., Lucas, S. M., estimates. In International Conference on Learning Rep- Cowling, P. I., Rohlfshagen, P., Tavener, S., Perez, D., resentations. Samothrakis, S., and Colton, S. (2012). A survey of monte carlo tree search methods. IEEE Transactions on He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep Computational Intelligence and AI in games, 4(1):1–43. residual learning for image recognition. In Proceedings of the IEEE conference on and pattern Bubeck, S., Cesa-Bianchi, N., et al. (2012). Regret analysis recognition, pages 770–778. of stochastic and nonstochastic multi-armed bandit prob- lems. Foundations and Trends R in Machine Learning, Horgan, D., Quan, J., Budden, D., Barth-Maron, G., Hessel, 5(1):1–122. M., van Hasselt, H., and Silver, D. (2018). Distributed prioritized experience replay. In International Conference Csiszar,´ I. (1964). Eine informationstheoretische ungle- on Learning Representations. ichung und ihre anwendung auf beweis der ergodizitaet von markoffschen ketten. Magyer Tud. Akad. Mat. Kutato Kocsis, L. and Szepesvari,´ C. (2006). Bandit based Monte- Int. Koezl., 8:85–108. Carlo planning. In European conference on machine learning, pages 282–293. Springer. Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plappert, M., Radford, A., Schulman, J., Sidor, S., Wu, Y., and Levine, S. (2018). Reinforcement learning and control Zhokhov, P. (2017). Openai baselines. as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909. Dulac-Arnold, G., Evans, R., van Hasselt, H., Sunehag, P., Lillicrap, T., Hunt, J., Mann, T., Weber, T., Degris, T., and Liese, F. and Vajda, I. (2006). On divergences and informa- Coppin, B. (2015). Deep reinforcement learning in large tions in statistics and information theory. IEEE Transac- discrete action spaces. arXiv preprint arXiv:1512.07679. tions on Information Theory, 52(10):4394–4412. MCTS as regularized policy optimization

Metz, L., Ibarz, J., Jaitly, N., and Davidson, J. (2017). Dis- Song, H. F., Abdolmaleki, A., Springenberg, J. T., Clark, A., crete sequential prediction of continuous actions for deep Soyer, H., Rae, J. W., Noury, S., Ahuja, A., Liu, S., Tiru- rl. arXiv preprint arXiv:1705.05035. mala, D., et al. (2019). V-MPO: On-policy maximum a posteriori policy optimization for discrete and continuous Neu, G., Jonsson, A., and Gomez,´ V. (2017). A unified view control. arXiv preprint arXiv:1909.12238. of entropy-regularized Markov decision processes. arXiv preprint arXiv:1705.07798. Sutton, R. S. and Barto, A. G. (1998). Reinforcement Learn- ing: An Introduction. MIT Press. O’Donoghue, B., Munos, R., Kavukcuoglu, K., and Mnih, V. (2016). Combining policy gradient and Q-learning. Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, arXiv preprint arXiv:1611.01626. Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Advances in Oh, J., Singh, S., and Lee, H. (2017). Value prediction neural information processing systems, pages 1057–1063. network. In Advances in Neural Information Processing Systems, pages 6118–6128. Tang, Y. and Agrawal, S. (2019). Discretizing continuous action space for on-policy optimization. arXiv preprint Rosin, C. D. (2011). Multi-armed bandits with episode con- arXiv:1901.10500. text. Annals of Mathematics and Artificial Intelligence, 61(3):203–230. Tassa, Y., Doron, Y., Muldal, A., Erez, T., Li, Y., Casas, D. d. L., Budden, D., Abdolmaleki, A., Merel, J., Lefrancq, Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, A., et al. (2018). DeepMind control suite. arXiv preprint K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hass- arXiv:1801.00690. abis, D., Graepel, T., et al. (2019). Mastering Atari, go, Van de Wiele, T., Warde-Farley, D., Mnih, A., and Mnih, and by planning with a learned model. arXiv V. (2020). Q-learning in enormous action spaces via preprint arXiv:1911.08265. amortized approximate maximization. arXiv preprint Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, arXiv:2001.08116. P. (2015). Trust region policy optimization. In Interna- Williams, R. J. (1992). Simple statistical gradient-following tional conference on machine learning, pages 1889–1897. algorithms for connectionist reinforcement learning. Ma- Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and chine learning, 8(3-4):229–256. Klimov, O. (2017). Proximal policy optimization algo- Ziebart, B. D. (2010). Modeling Purposeful Adaptive Be- rithms. arXiv preprint arXiv:1707.06347. havior with the Principle of Maximum Causal Entropy. PhD thesis, Carnegie Mellon University, USA. Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. (2016). Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484.

Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Grae- pel, T., et al. (2017a). Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815.

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. (2017b). Mastering the game of go without human knowledge. Nature, 550(7676):354–359.

Silver, D., van Hasselt, H., Hessel, M., Schaul, T., Guez, A., Harley, T., Dulac-Arnold, G., Reichert, D., Rabinowitz, N., Barreto, A., et al. (2017c). The predictron: End-to- end learning and planning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3191–3199. JMLR. org.