Convex Regularization in Monte-Carlo Tree Search

Tuan Dam 1 Carlo D’Eramo 1 Jan Peters 1 2 Joni Pajarinen 1 3

Abstract 1. Introduction Monte-Carlo Tree Search (MCTS) is a well-known algo- rithm to solve decision-making problems through the com- Monte-Carlo planning and Reinforcement Learn- bination of Monte-Carlo planning and an incremental tree ing (RL) are essential to sequential decision mak- structure (Coulom, 2006). MCTS provides a principled ap- ing. The recent AlphaGo and AlphaZero algo- proach for trading off between exploration and exploitation rithms have shown how to successfully combine in sequential decision making. Moreover, recent advances these two paradigms to solve large scale sequen- have shown how to enable MCTS in continuous and large tial decision problems. These methodologies ex- problems (Silver et al., 2016; Yee et al., 2016). Most remark- ploit a variant of the well-known UCT algorithm ably, AlphaGo (Silver et al., 2016) and AlphaZero (Silver to trade off the exploitation of good actions and et al., 2017a;b) couple MCTS with neural networks trained the exploration of unvisited states, but their em- using (RL) (Sutton & Barto, 1998) pirical success comes at the cost of poor sample- methods, e.g., Deep Q-Learning (Mnih et al., 2015), to efficiency and high computation time. In this speed up learning of large scale problems. In particular, a paper, we overcome these limitations by intro- neural network is used to compute value function estimates ducing the use of convex regularization in Monte- of states as a replacement of time-consuming Monte-Carlo Carlo Tree Search (MCTS) to drive exploration ef- rollouts, and another neural network is used to estimate ficiently and to improve policy updates. First, we policies as a probability prior for the therein introduced introduce a unifying theory on the use of generic PUCT action selection strategy, a variant of well-known convex regularizers in MCTS, deriving the first UCT sampling strategy commonly used in MCTS for ex- regret analysis of regularized MCTS and showing ploration (Kocsis et al., 2006). Despite AlphaGo and Al- that it guarantees an exponential convergence rate. phaZero achieving state-of-the-art performance in games Second, we exploit our theoretical framework to with high branching factor like Go (Silver et al., 2016) and introduce novel regularized backup operators for (Silver et al., 2017a), both methods suffer from poor MCTS, based on the relative entropy of the policy sample-efficiency, mostly due to the polynomial conver- update and, more importantly, on the Tsallis en- gence rate of PUCT (Xiao et al., 2019). This problem, tropy of the policy, for which we prove superior combined with the high computational time to evaluate the theoretical guarantees. We empirically verify the deep neural networks, significantly hinder the applicability consequence of our theoretical results on a toy of both methodologies. problem. Finally, we show how our framework can easily be incorporated in AlphaGo and we In this paper, we provide a theory of the use of convex regularization in MCTS, which proved to be an efficient arXiv:2007.00391v3 [cs.LG] 16 Feb 2021 empirically show the superiority of convex regu- larization, w.r.t. representative baselines, on well- solution for driving exploration and stabilizing learning in known RL problems across several Atari games. RL (Schulman et al., 2015; 2017a; Haarnoja et al., 2018; Buesing et al., 2020). In particular, we show how a reg- ularized objective function in MCTS can be seen as an instance of the Legendre-Fenchel transform, similar to pre- vious findings on the use of duality in RL (Mensch & Blon- 1Department of , Technische Univer- sitat¨ Darmstadt, Germany 2Robot Learning Group, Max del, 2018; Geist et al., 2019; Nachum & Dai, 2020a) and Planck Institute for Intelligent Systems,Tubingen,¨ Germany game theory (Shalev-Shwartz & Singer, 2006; Pavel, 2007). 3Computing Sciences, Alto University, Finland. Correspondence Establishing our theoretical framework, we can derive the to: Tuan Dam , Carlo D’Eramo first regret analysis of regularized MCTS, and prove that a , Jan Peters , Joni Pajarinen . vergence rate to the solution of the regularized objective function, which improves on the polynomial rate of PUCT. Convex Regularization in Monte-Carlo Tree Search

These results provide a theoretical ground for the use of are the visited states of the MDP, and the edges are the ac- arbitrary entropy-based regularizers in MCTS until now tions executed in each state. MCTS converges to the optimal limited to maximum entropy (Xiao et al., 2019), among policy (Kocsis et al., 2006; Xiao et al., 2019), iterating over which we specifically study the relative entropy of policy a loop composed of four steps: updates, drawing on similarities with trust-region and prox- imal methods in RL (Schulman et al., 2015; 2017b), and 1. Selection: starting from the root node, a tree-policy is the Tsallis entropy, used for enforcing the learning of sparse executed to navigate the tree until a node with unvisited policies (Lee et al., 2018). Moreover, we provide an em- children, i.e. expandable node, is reached; pirical analysis of the toy problem introduced in Xiao et al. (2019) to evince the practical consequences of our theo- 2. Expansion: the reached node is expanded according retical results for each regularizer. Finally, we empirically to the tree policy; evaluate the proposed operators in AlphaGo, on several 3. Simulation: run a rollout, e.g. Monte-Carlo simula- Atari games, confirming the benefit of convex regularization tion, from the visited child of the current node to the in MCTS, and in particular the superiority of Tsallis entropy end of the episode; w.r.t. other regularizers. 4. Backup: use the collected reward to update the action- 2. Preliminaries values Q(·) of the nodes visited in the trajectory from the root node to the expanded node. 2.1. Markov Decision Processes We consider the classical definition of a finite- The tree-policy used to select the action to execute in each horizon (MDP) as a 5-tuple node needs to balance the use of already known good ac- M = hS, A, R, P, γi, where S is the state space, A is tions, and the visitation of unknown states. The Upper Con- fidence bounds for Trees (UCT) sampling strategy (Kocsis the finite discrete action space, R : S × A × S → R is the reward function, P : S × A → S is the transition et al., 2006) extends the use of the well-known UCB1 sam- kernel, and γ ∈ [0, 1) is the discount factor. A policy pling strategy for multi-armed bandits (Auer et al., 2002), to MCTS. Considering each node corresponding to a state π ∈ Π: S ×A → R is a probability distribution of the event of executing an action a in a state s. A policy π induces a s ∈ S as a different bandit problem, UCT selects an action value function corresponding to the expected cumulative a ∈ A applying an upper bound to the action-value function discounted reward collected by the agent when executing s log N(s) action a in state s, and following the policy π thereafter: UCT(s, a) = Q(s, a) +  , (1) π P∞ k  N(s, a) Q (s, a) , E k=0 γ ri+k+1|si = s, ai = a, π , where ri+1 is the reward obtained after the i-th transition. An MDP is solved finding the optimal policy π∗, which is the policy where N(s, a) is the number of executions of action a in P that maximizes the expected cumulative discounted reward. state s, N(s) = a N(s, a), and  is a constant parameter The optimal policy corresponds to the one satisfying the to tune exploration. UCT asymptotically converges to the ∗ Q∗ optimal Bellman equation (Bellman, 1954) Q (s, a) , optimal action-value function , for all states and actions, R 0 0 ∗ 0 0 0 P(s |s, a)[R(s, a, s ) + γ maxa0 Q (s , a )] ds , with the probability of executing a suboptimal action at the S 1 and is the fixed point of the opti- root node approaching 0 with a polynomial rate O( t ), for a ∗ t mal Bellman operator T Q(s, a) , simulation budget (Kocsis et al., 2006; Xiao et al., 2019). R 0 0 0 0 0 S P(s |s, a)[R(s, a, s ) + γ maxa0 Q(s , a )] ds . Additionally, we define the Bellman oper- 3. Regularized Monte-Carlo Tree Search ator under the policy π as TπQ(s, a) , R 0  0 R 0 0 0 0 0 0 The success of RL methods based on entropy regulariza- S P(s |s, a) R(s, a, s ) + γ A π(a |s )Q(s , a )da ds , ∗ ∗ tion comes from their ability to achieve state-of-the-art per- the optimal value function V (s) , maxa∈A Q (s, a), and the value function under the policy π as formance in decision making and control problems, while π π enjoying theoretical guarantees and ease of implementa- V (s) , maxa∈A Q (s, a). tion (Haarnoja et al., 2018; Schulman et al., 2015; Lee et al., 2018). However, the use of entropy regularization is MCTS 2.2. Monte-Carlo Tree Search and Upper Confidence is still mostly unexplored, although its advantageous explo- bounds for Trees ration and value function estimation would be desirable to Monte-Carlo Tree Search (MCTS) is a planning strategy reduce the detrimental effect of high-branching factor in based on a combination of Monte-Carlo sampling and tree AlphaGo and AlphaZero. To the best of our knowledge, search to solve MDPs. MCTS builds a tree where the nodes the MENTS algorithm (Xiao et al., 2019) is the first and only method to combine MCTS and entropy regularization. Convex Regularization in Monte-Carlo Tree Search

In particular, MENTS uses a maximum entropy regularizer Solving equation (2) leads to the solution of the optimal pri- in AlphaGo, proving an exponential convergence rate to mal policy function ∇Ω∗(·). Since Ω(·) is strongly convex, the solution of the respective softmax objective function the dual function Ω∗(·) is also convex. One can solve the and achieving state-of-the-art performance in some Atari optimization problem (2) in the dual space (Nachum & Dai, games (Bellemare et al., 2013). In the following, motivated 2020b) as by the success in RL and the promising results of MENTS, ∗ Ω(πs) = max Tπs Qs − τΩ (Qs) (6) we derive a unified theory of regularization in MCTS based A Qs∈ on the Legendre-Fenchel transform (Geist et al., 2019), that R generalizes the use of maximum entropy of MENTS to and find the solution of the optimal dual value function an arbitrary convex regularizer. Notably, our theoretical as Ω∗(·). Note that the Legendre-Fenchel transform of framework enables to rigorously motivate the advantages of the value conjugate function is the convex function Ω, i.e. using maximum entropy and other entropy-based regular- Ω∗∗ = Ω. In the next section, we leverage on this primal- izers, such as relative entropy or Tsallis entropy, drawing dual connection based on the Legendre-Fenchel transform connections with their RL counterparts TRPO (Schulman as both conjugate value function and policy function, to et al., 2015) and Sparse DQN (Lee et al., 2018), as MENTS derive the regularized MCTS backup and tree policy. does with Soft Actor-Critic (SAC) (Haarnoja et al., 2018). 3.2. Regularized backup and tree policy 3.1. Legendre-Fenchel transform In MCTS, each node of the tree represents a state s ∈ S Consider an MDP M = hS, A, R, P, γi, as previously and contains a visitation count N(s, a). Given a trajec- defined. Let Ω:Π → R be a strongly convex function. tory, we define n(sT ) as the leaf node corresponding to the A For a policy πs = π(·|s) and Qs = Q(s, ·) ∈ R , the reached state sT . Let s0, a0, s1, a1..., sT be the state action Legendre-Fenchel transform (or convex conjugate) of Ω is trajectory in a simulation, where n(sT ) is a leaf node of T . ∗ A Ω : R → R, defined as: Whenever a node n(sT ) is expanded, the respective action ∗ values (Equation7) are initialized as QΩ(sT , a) = 0, and Ω (Qs) , max Tπs Qs − τΩ(πs), (2) πs∈Πs N(sT , a) = 0 for all a ∈ A. For all nodes in the trajectory, the visitation count is updated by N(st, at) = N(st, at)+1, where the temperature τ specifies the strength of regulariza- and the action-values by tion. Among the several properties of the Legendre-Fenchel transform, we use the following (Mensch & Blondel, 2018; ( r(st, at) + γρ if t = T Geist et al., 2019). QΩ(st, at) = ∗ r(st, at) + γΩ (QΩ(st+1)/τ) if t < T Proposition 1 Let Ω be strongly convex. (7) A where QΩ(st+1) ∈ R with QΩ(st+1, a), ∀a ∈ A, and ρ is • Unique maximizing argument: ∇Ω∗ is Lipschitz and an estimate returned from an computed satisfies in sT , e.g. a discounted cumulative reward averaged over multiple rollouts, or the value-function of node n(sT +1) ∗ ∇Ω (Qs) = arg max Tπs Qs − τΩ(πs). (3) returned by a value-function approximator, e.g. a neural πs∈Πs network pretrained with deep Q-learning (Mnih et al., 2015), as done in (Silver et al., 2016; Xiao et al., 2019). We revisit • Boundedness: if there are constants L and U such Ω Ω the E2W sampling strategy limited to maximum entropy that for all πs ∈ Πs, we have LΩ ≤ Ω(πs) ≤ UΩ, then regularization (Xiao et al., 2019) and, through the use of the convex conjugate in Equation (7), we derive a novel ∗ max Qs(a) − τUΩ ≤ Ω (Qs) ≤ max Qs(a) − τLΩ. sampling strategy that generalizes to any convex regularizer a∈A a∈A (4) λ π (a |s ) = (1 − λ )∇Ω∗(Q (s )/τ)(a ) + st , (8) t t t st Ω t t |A| S×A • Contraction: for any Q1,Q2 ∈ R |A| P where λs = /log( N(st,a)+1) with  > 0 as an explo- ∗ ∗ t a k Ω (Q1) − Ω (Q2) k∞≤ γ k Q1 − Q2 k∞ . (5) ration parameter, and ∇Ω∗ depends on the measure in use (see Table1 for maximum, relative, and Tsallis entropy). Note that if Ω(·) is strongly convex, τΩ(·) is also strongly We call this sampling strategy Extended Empirical Exponen- convex; thus all the properties shown in Proposition 1 still tial Weight (E3W) to highlight the extension of E2W from 1 hold . maximum entropy to a generic convex regularizer. E3W 1Other works use the same formula, e.g. Equation (2) in Nicu- defines the connection to the duality representation using lae & Blondel(2017). the Legendre-Fenchel transform, that is missing in E2W. Convex Regularization in Monte-Carlo Tree Search

Moreover, while the Legendre-Fenchel transform can be 2015) and proximal optimization methods (Schulman et al., used to derive a theory of several state-of-the-art algorithms 2017b) in RL, and on the maximization of Tsallis entropy, in RL, such as TRPO, SAC, A3C (Geist & Scherrer, 2011), which has been more recently introduced in RL as an effec- our result is the first introducing the connection with MCTS. tive solution to enforce the learning of sparse policies (Lee et al., 2018). We call these algorithms RENTS and TENTS. 3.3. Convergence rate to regularized objective Contrary to maximum and relative entropy, the definition of the Legendre-Fenchel and maximizing argument of Tsallis We show that the regularized value VΩ can be effectively entropy is non-trivial, being estimated at the root state s ∈ S, with the assumption that 2 each node in the tree has a σ -subgaussian distribution. This ∗ Ω (Qt) = τ · spmax(Qt(s, ·)/τ), (11) result extends the analysis provided in (Xiao et al., 2019), P Q (s, a) Qt(s, a)/τ − 1 which is limited to the use of maximum entropy. ∇Ω∗(Q ) = max{ t − a∈K , 0}, t τ |K| (12) Theorem 1 At the root node s where N(s) is the number of visitations, with  > 0, VΩ(s) is the estimated value, with where spmax is defined for any function f : S × A → R as constant C and Cˆ, we have spmax(f(s, ·)) , (13) ∗ −N(s) (|V (s) − V (s)| > ) ≤ C exp{ }, 2 P 2 ! P Ω Ω ˆ 2 X f(s, a) ( f(s, a) − 1) 1 Cσ log (2 + N(s)) − a∈K + , 2 2|K|2 2 (9) a∈K

∗ ∗ ∗ ∗ and K is the set of actions that satisfy 1 + if(s, a ) > where VΩ(s) = Ω (Qs) and V (s) = Ω (Q ). i Ω s Pi j=1 f(s, aj), with ai indicating the action with the i-th From this theorem, we obtain that the convergence rate of largest value of f(s, a) (Lee et al., 2018). choosing the best action a∗ at the root node, when using the E3W strategy, is exponential. 4.1. Regret analysis

Theorem 2 Let at be the action returned by E3W at step t. At the root node, let each children node i be assigned ˆ For large enough t and constants C, C with a random variable Xi, with mean value Vi, while the quantities related to the optimal branch are denoted ∗ t ∗ P(at 6= a ) ≤ Ct exp{− }. (10) by ∗, e.g. mean value V . At each timestep n, the mean Cσˆ (log(t))3 value of variable Xi is Vin . The pseudo-regret (Coquelin & Munos, 2007) at the root node, at timestep n, is defined as UCT ∗ Pn This result shows that, for every strongly convex regularizer, Rn = nV − t=1 Vit . Similarly, we define the regret the convergence rate of choosing the best action at the root of E3W at the root node of the tree as node is exponential, as already proven in the specific case n n of maximum entropy (Xiao et al., 2019). ∗ X ∗ X Rn = nV − Vit = nV − I(it = i)Vit (14) t=1 t=1 4. Entropy-regularization backup operators n ∗ X X = nV − Vi πˆt(ai|s), From the introduction of a unified view of generic strongly i t=1 convex regularizers as backup operators in MCTS, we nar- row the analysis to entropy-based regularizers. For each where πˆt(·) is the policy at time step t, and I(·) is the indi- entropy function, Table1 shows the Legendre-Fenchel trans- cator function. form and the maximizing argument, which can be respec- The expected regret is defined as tively replaced in our backup operation (Equation7) and sampling strategy E3W (Equation8). Using maximum n ∗ X entropy retrieves the maximum entropy MCTS problem in- E[Rn] = nV − hπˆt(·),V (·)i . (15) troduced in the MENTS algorithm (Xiao et al., 2019). This t=1 approach closely resembles the maximum entropy RL frame- work used to encourage exploration (Haarnoja et al., 2018; Schulman et al., 2017a). We introduce two novel MCTS Theorem 3 Consider an E3W policy applied to the tree. ∗ ∗ ∗ algorithms based on the minimization of relative entropy of Let define DΩ∗ (x, y) = Ω (x) − Ω (y) − ∇Ω (y)(x − y) the policy update, inspired by trust-region (Schulman et al., as the Bregman divergence between x and y, The expected Convex Regularization in Monte-Carlo Tree Search

Table 1. List of entropy regularizers with Legendre-Fenchel transforms and maximizing arguments.

∗ ∗ Entropy Regularizer Ω(πs) Legendre-Fenchel Ω (Qs) Max argument ∇Ω (Qs)

Q(s,a) Q(s,a) e τ Maximum P π(a|s) log π(a|s) τ log P e τ a a P Q(s,b) b e τ Qt(s,a) P Qt(s,a) πt−1(a|s)e τ Relative DKL(πt(a|s)||πt−1(a|s)) τ log πt−1(a|s)e τ a P Qt(s,b) b πt−1(b|s)e τ 1 2 Tsallis 2 (k π(a|s) k2 −1) Equation (11) Equation (12)

pseudo regret Rn satisfies εΩ satisfies n s s X Cσˆ 2 log C τ(U − L ) Cσˆ 2 log C [R ] ≤ − τΩ(ˆπ) + D ∗ (Vˆ (·) + V (·), Vˆ (·)) δ Ω Ω δ E n Ω t t (16) − − ≤ εΩ ≤ . t=1 2N(s) 1 − γ 2N(s) n (17) + O( ). log n To the best of our knowledge, this theorem provides the first This theorem bounds the regret of E3W for a generic convex result on the error analysis of value estimation at the root regularizer Ω; the regret bounds for each entropy regularizer node of convex regularization in MCTS. To give a better ∗ can be easily derived from it. Let m = mina ∇Ω (a|s). understanding of the effect of each entropy regularizer in Table1, we specialize the bound in Equation 17 to each Corollary 1 Maximum entropy regret: of them. From (Lee et al., 2018), we know that for maxi- n|A| n [Rn] ≤ τ(log |A|) + + O( ). P E τ log n mum entropy Ω(πt) = a πt log πt, we have − log |A| ≤ Ω(π ) ≤ 0; for relative entropy Ω(π ) = KL(π ||π ), Corollary 2 Relative entropy regret: t t t t−1 1 n|A| n if we define m = mina πt−1(a|s), then we can derive E[Rn] ≤ τ(log |A| − m ) + τ + O( log n ). 1 0 ≤ Ω(πt) ≤ − log |A| + log m ; and for Tsallis entropy 1 2 |A|−1 Corollary 3 Tsallis entropy regret: Ω(πt) = 2 (k πt k2 −1), we have − 2|A| ≤ Ω(πt) ≤ 0. |A|−1 n|K| n r [R ] ≤ τ( ) + + O( ). ˆ 2 C E n |A| 2 log n Cσ log δ Then, defining Ψ = 2N(s) , Remarks. The regret bound of UCT and its variance have Corollary 4 Maximum entropy error: already been analyzed for non-regularized MCTS with bi- τ log |A| nary tree (Coquelin & Munos, 2007). On the contrary, our −Ψ − ≤ εΩ ≤ Ψ. 1 − γ regret bound analysis in Theorem 3 applies to generic regu- larized MCTS. From the specialized bounds in the corollar- Corollary 5 Relative entropy error: ies, we observe that the maximum and relative entropy share τ(log |A| − log 1 ) −Ψ − m ≤ ε ≤ Ψ. similar results, although the bounds for relative entropy are 1 − γ Ω 1 slightly smaller due to m . Remarkably, the bounds for Tsal- lis entropy become tighter for increasing number of actions, Corollary 6 Tsallis entropy error: |A| − 1 τ which translates in limited regret in problems with high −Ψ − ≤ εΩ ≤ Ψ. branching factor. This result establishes the advantage of 2|A| 1 − γ Tsallis entropy in complex problems w.r.t. to other entropy These results show that when the number of actions |A| is regularizers, as empirically confirmed in Section5. large, TENTS enjoys the smallest error; moreover, we also see that lower bound of RENTS is always smaller than for 4.2. Error analysis MENTS. We analyse the error of the regularized value estimate at the root node n(s) w.r.t. the optimal value: εΩ = VΩ(s) − 5. Empirical evaluation V ∗(s). In this section, we empirically evaluate the benefit of the Theorem 4 For any δ > 0 and generic convex regularizer proposed entropy-based MCTS regularizers. First, we com- Ω, with some constant C, Cˆ, with probability at least 1 − δ, plement our theoretical analysis with an empirical study Convex Regularization in Monte-Carlo Tree Search

k=16 d=1 k=4 d=2 k=8 d=3 k=12 d=4 k=16 d=5 0.05 0.10 0.20 0.5 0.8 0.04 0.08 0.15 0.4 0.6 0.03 0.06 0.3 0.10 0.4 0.02 0.04 0.2 0.01 0.02 0.05 0.1 0.2 0.00 0.00 0.00 0.0 0.0 0.20 0.25 0.4 0.6 1.0 0.5 0.15 0.20 0.3 0.8 0.4 T 0.15 0.6 C

U 0.10 0.2 0.3 0.10 0.2 0.4 0.05 0.1 0.05 0.1 0.2 0.00 0.00 0.0 0.0 0.0 1500 800 500 500 1250 600 600 1000 400 400 300 300 400 R 750 400 500 200 200 200 200 250 100 100 0 0 0 0 0 0 5e3 10e3 0 5e3 10e3 0 5e3 10e3 0 5e3 10e3 0 5e3 10e3 # Simulations # Simulations # Simulations # Simulations # Simulations

UCT MENTS RENTS TENTS

Figure 1. For each algorithm, we show the convergence of the value estimate at the root node to the respective optimal value (top), to the UCT optimal value (middle), and the regret (bottom). of the synthetic tree toy problem introduced in Xiao et al. the means are normalized between 0 and 1. As in Xiao et al. (2019), which serves as a simple scenario to give an in- (2019), we create 5 trees on which we perform 5 different terpretable demonstration of the effects of our theoretical runs in each, resulting in 25 experiments, for all the combi- results in practice. Second, we compare to AlphaGo (Silver nations of branching factor k = {2, 4, 6, 8, 10, 12, 14, 16} et al., 2016), recently introduced to enable MCTS to solve and depth d = {1, 2, 3, 4, 5}, computing: (i) the value es- large scale problems with high branching factor. Our imple- timation error at the root node w.r.t. the regularized opti- ∗ mentation is a simplified version of the original algorithm, mal value: εΩ = VΩ − VΩ ; (ii) the value estimation er- where we remove various tricks in favor of better inter- ror at the root node w.r.t. the unregularized optimal value: ∗ pretability. For the same reason, we do not compare with εUCT = VΩ − VUCT; (iii) the regret R as in Equation (14). the most recent and state-of-the-art MuZero (Schrittwieser For a fair comparison, we use fixed τ = 0.1 and  = 0.1 et al., 2019), as this is a slightly different solution highly across all algorithms. Figure1 and2 show how UCT and tuned to maximize performance, and a detailed description each regularizer behave for different configurations of the of its implementation is not available. tree. We observe that, while RENTS and MENTS converge slower for increasing tree sizes, TENTS is robust w.r.t. the 5.1. Synthetic tree size of the tree and almost always converges faster than all other methods to the respective optimal value. Notably, the This toy problem is introduced in Xiao et al.(2019) to high- optimal value of TENTS seems to be very close to the one light the improvement of MENTS over UCT. It consists of of UCT, i.e. the optimal value of the unregularized objective, a tree with branching factor k and depth d. Each edge of and also converges faster than the one estimated by UCT, the tree is assigned a random value between 0 and 1. At while MENTS and RENTS are considerably further from each leaf, a Gaussian distribution is used as an evaluation this value. In terms of regret, UCT explores less than the function resembling the return of random rollouts. The regularized methods and it is less prone to high regret, at the mean of the Gaussian distribution is the sum of the values cost of slower convergence time. Nevertheless, the regret assigned to the edges connecting the root node to the leaf, of TENTS is the smallest between the ones of the other 2 while the standard deviation is σ = 0.05 . For stability, all regularizers, which seem to explore too much. These results 2The value of the standard deviation is not provided in Xiao show a general superiority of TENTS in this toy problem, et al.(2019). After trying different values, we observed that our also confirming our theoretical findings about the advantage results match the one in Xiao et al.(2019) when using σ = 0.05. of TENTS in terms of approximation error (Corollary6) Convex Regularization in Monte-Carlo Tree Search

UCT MENTS UCT UCT MENTS UCT R MENTS 2 2 2 2 2 2 4 4 4 4 4 4 6 6 6 6 0.7 6 6 1400 8 8 0.30 8 8 8 8 10 10 10 10 0.6 10 10 1200 12 12 12 12 12 12 0.25 14 14 14 14 0.5 14 14 1000 16 16 16 16 16 16 0.20 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 0.4 1 2 3 4 5 1 2 3 4 5 800 RENTS TENTS RENTS TENTS RENTS TENTS 0.15 0.3 600 2 2 2 2 2 2 4 4 0.10 4 4 4 4 6 6 6 6 0.2 6 6 400 8 8 8 8 8 8 10 10 0.05 10 10 0.1 10 10 200 12 12 12 12 12 12 14 14 0.00 14 14 0.0 14 14 0 16 16 16 16 16 16 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 (a) (b) (c)

Figure 2. For different branching factor k (rows) and depth d (columns), the heatmaps show: the absolute error of the value estimate at the root node after the last simulation of each algorithm w.r.t. the respective optimal value (a), and w.r.t. the optimal value of UCT (b); regret at the root node (c). and regret (Corollary3), in problems with many actions. comparison with classic baselines. We observe that regular- ized MCTS dominates other baselines, in particular TENTS 5.2. Entropy-regularized AlphaGo achieves the highest scores in all the 22 games, showing that sparse policies are more effective in Atari. In particular, The learning time of AlphaZero can be slow in problems TENTS significantly outperforms the other methods in the with high branching factor, due to the need of a large num- games with many actions, e.g. Asteroids, Phoenix, confirm- ber of MCTS simulations for obtaining good estimates of ing the results obtained in the synthetic tree experiment, the randomly initialized action-values. To overcome this explained by corollaries3 and6 on the benefit of TENTS in problem, AlphaGo (Silver et al., 2016) initializes the action- problems with high-branching factor. values using the values retrieved from a pretrained network, which is kept fixed during the training. 6. Related Work Atari. Atari 2600 (Bellemare et al., 2013) is a popular benchmark for testing deep RL methodologies (Mnih et al., Entropy regularization is a common tool for controlling ex- 2015; Van Hasselt et al., 2016; Bellemare et al., 2017) ploration in Reinforcement Learning (RL) and has lead to but still relatively disregarded in MCTS. We use a Deep several successful methods (Schulman et al., 2015; Haarnoja Q-Network, pretrained using the same experimental set- et al., 2018; Schulman et al., 2017a; Mnih et al., 2016). Typi- ting of Mnih et al.(2015), to initialize the action-value cally specific forms of entropy are utilized such as maximum function of each node after expansion as Qinit(s, a) = entropy (Haarnoja et al., 2018) or relative entropy (Schul- (Q(s, a) − V (s)) /τ, for MENTS and TENTS, as done man et al., 2015). This approach is an instance of the more in Xiao et al.(2019). For RENTS we init Qinit(s, a) = generic duality framework, commonly used in convex op- log Pprior(a|s)) + (Q(s, a) − V (s)) /τ, where Pprior is the timization theory. Duality has been extensively studied in Boltzmann distribution induced by action-values Q(s, .) game theory (Shalev-Shwartz & Singer, 2006; Pavel, 2007) computed from the network. Each experimental run consists and more recently in RL, for instance considering mirror de- of 512 MCTS simulations. The temperature τ is optimized scent optimization (Montgomery & Levine, 2016; Mei et al., for each algorithm and game via grid-search between 0.01 2019), drawing the connection between MCTS and regular- and 1. The discount factor is γ = 0.99, and for PUCT ized policy optimization (Grill et al., 2020), or formalizing the exploration constant is c = 0.1. Table2 shows the the RL objective via Legendre-Rockafellar duality (Nachum performance, in terms of cumulative reward, of standard & Dai, 2020a). Recently (Geist et al., 2019) introduced AlphaGo with PUCT and our three regularized versions, on regularized Markov Decision Processes, formalizing the 22 Atari games. Moreover, we test also AlphaGo using the RL objective with a generalized form of convex regular- MaxMCTS backup (Khandelwal et al., 2016) for further ization, based on the Legendre-Fenchel transform. In this paper, we provide a novel study of convex regularization Convex Regularization in Monte-Carlo Tree Search

Table 2. Average score in Atari over 100 seeds per game. Bold denotes no statistically significant difference to the highest mean (t-test, p < 0.05). Bottom row shows # no difference to highest mean. UCT MaxMCTS MENTS RENTS TENTS Alien 1, 486.80 1, 461.10 1, 508.60 1, 547.80 1, 568.60 Amidar 115.62 124.92 123.30 125.58 121.84 Asterix 4, 855.00 5, 484.50 5, 576.00 5, 743.50 5, 647.00 Asteroids 873.40 899.60 1, 414.70 1, 486.40 1, 642.10 Atlantis 35, 182.00 35, 720.00 36, 277.00 35, 314.00 35, 756.00 BankHeist 475.50 458.60 622.30 636.70 631.40 BeamRider 2, 616.72 2, 661.30 2, 822.18 2, 558.94 2, 804.88 Breakout 303.04 296.14 309.03 300.35 316.68 Centipede 1, 782.18 1, 728.69 2, 012.86 2, 253.42 2, 258.89 DemonAttack 579.90 640.80 1, 044.50 1, 124.70 1, 113.30 Enduro 129.28 124.20 128.79 134.88 132.05 Frostbite 1, 244.00 1, 332.10 2, 388.20 2, 369.80 2, 260.60 Gopher 3, 348.40 3, 303.00 3, 536.40 3, 372.80 3, 447.80 Hero 3, 009.95 3, 010.55 3, 044.55 3, 077.20 3, 074.00 MsPacman 1, 940.20 1, 907.10 2, 018.30 2, 190.30 2, 094.40 Phoenix 2, 747.30 2, 626.60 3, 098.30 2, 582.30 3, 975.30 Qbert 7, 987.25 8, 033.50 8, 051.25 8, 254.00 8, 437.75 Robotank 11.43 11.00 11.59 11.51 11.47 Seaquest 3, 276.40 3, 217.20 3, 312.40 3, 345.20 3, 324.40 Solaris 895.00 923.20 1, 118.20 1, 115.00 1, 127.60 SpaceInvaders 778.45 835.90 832.55 867.35 822.95 WizardOfWor 685.00 666.00 1, 211.00 1, 241.00 1, 231.00 # Highest mean 6/22 7/22 17/22 16/22 22/22 in MCTS, and derive relative entropy (KL-divergence) and (Khandelwal et al., 2016) formalizes and analyzes differ- Tsallis entropy regularized MCTS algorithms, i.e. RENTS ent on-policy and off-policy complex backup approaches and TENTS respectively. Note that the recent maximum for MCTS planning based on RL techniques. (Vodopivec entropy MCTS algorithm MENTS (Xiao et al., 2019) is a et al., 2017) proposes an approach called SARSA-UCT, special case of our generalized regularized MCTS. Unlike which performs the backups using MENTS, RENTS can take advantage of any action distri- SARSA (Rummery, 1995). Both (Khandelwal et al., 2016) bution prior, in the experiments the prior is derived using and (Vodopivec et al., 2017) directly borrow value backup Deep Q-learning (Mnih et al., 2015). On the other hand, ideas from RL to estimate the value at each tree node, but TENTS allows for sparse action exploration and thus higher they do not provide any proof of convergence. dimensional action spaces compared to MENTS. Several works focus on modifying classical MCTS to im- 7. Conclusion prove exploration. UCB1-tuned (Auer et al., 2002) modifies We introduced a theory of convex regularization in Monte- the upper confidence bound of UCB1 to account for vari- Carlo Tree Search (MCTS) based on the Legendre-Fenchel ance in order to improve exploration. (Tesauro et al., 2012) transform. We proved that a generic strongly convex regu- proposes a Bayesian version of UCT, which obtains better larizer has an exponential convergence rate for the selection estimates of node values and uncertainties given limited of the optimal action at the root node. Our result gives theo- experience. Many heuristic approaches based on specific retical motivations to previous results specific to maximum domain knowledge have been proposed, such as adding a entropy regularization. Furthermore, we provided the first bonus term to value estimates (Gelly & Wang, 2006; Tey- study of the regret of MCTS when using a generic strongly taud & Teytaud, 2010; Childs et al., 2008; Kozelek, 2009; convex regularizer, and an analysis of the error between the Chaslot et al., 2008) or prior knowledge collected during regularized value estimate at the root node and the optimal policy search (Gelly & Silver, 2007; Helmbold & Parker- regularized value. We use these results to motivate the use Wood, 2009; Lorentz, 2010; Tom, 2010; Hoock et al., 2010). Convex Regularization in Monte-Carlo Tree Search of entropy regularization in MCTS, considering maximum, Geist, M. and Scherrer, B. L1-penalized projected bellman relative, and Tsallis entropy, and we specialized our regret residual. In Proceedings of the European Workshop on and approximation error bounds to each entropy-regularizer. Reinforcement Learning (EWRL 2011), Lecture Notes in We tested our regularized MCTS algorithm in a simple toy Computer Science (LNCS). Springer Verlag - Heidelberg problem, where we give an empirical evidence of the effect Berlin, september 2011. of our theoretical bounds for the regret and approximation er- ror. Finally, we introduced the use of convex regularization Geist, M., Scherrer, B., and Pietquin, O. A theory of reg- in AlphaGo, and carried out experiments on several Atari ularized markov decision processes. In International games. Overall, our empirical results show the advantages Conference on , pp. 2160–2169, 2019. of convex regularization, and in particular the superiority of Tsallis entropy w.r.t. other entropy-regularizers. Gelly, S. and Silver, D. Combining online and offline knowl- edge in uct. In Proceedings of the 24th international References conference on Machine learning, pp. 273–280. ACM, 2007. Abernethy, J., Lee, C., and Tewari, A. Fighting ban- dits with a new kind of smoothness. arXiv preprint Gelly, S. and Wang, Y. Exploration exploitation in go: Uct arXiv:1512.04152, 2015. for monte-carlo go. In NIPS: Neural Information Process- ing Systems Conference On-line trading of Exploration Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time and Exploitation Workshop, 2006. analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002. Grill, J.-B., Altche,´ F., Tang, Y., Hubert, T., Valko, M., Antonoglou, I., and Munos, R. Monte-carlo tree search Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. as regularized policy optimization. arXiv preprint The arcade learning environment: An evaluation plat- arXiv:2007.12509, 2020. form for general agents. Journal of Artificial Intelligence Research , 47:253–279, 2013. Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft Bellemare, M. G., Dabney, W., and Munos, R. A distribu- actor-critic: Off-policy maximum entropy deep reinforce- tional perspective on reinforcement learning. In Proceed- ment learning with a stochastic actor. In International ings of the 34th International Conference on Machine Conference on Machine Learning, pp. 1861–1870, 2018. Learning-Volume 70, pp. 449–458. JMLR. org, 2017. Helmbold, D. P. and Parker-Wood, A. All-moves-as-first Bellman, R. The theory of dynamic programming. Technical heuristics in monte-carlo go. In IC-AI, pp. 605–610, 2009. report, Rand corp santa monica ca, 1954. Hoock, J.-B., Lee, C.-S., Rimmel, A., Teytaud, F., Wang, Buesing, L., Heess, N., and Weber, T. Approximate infer- M.-H., and Teytaud, O. Intelligent agents for the game of ence in discrete distributions with monte carlo tree search go. IEEE Computational Intelligence Magazine, 2010. and value functions. In International Conference on Ar- tificial Intelligence and Statistics, pp. 624–634. PMLR, Khandelwal, P., Liebman, E., Niekum, S., and Stone, P. On 2020. the analysis of complex backup strategies in monte carlo tree search. In International Conference on Machine Chaslot, G., Winands, M., Herik, J. V. D., Uiterwijk, J., Learning, 2016. and Bouzy, B. Progressive strategies for monte-carlo tree search. New Mathematics and Natural Computation, 4 Kocsis, L., Szepesvari,´ C., and Willemson, J. Improved (03):343–357, 2008. monte-carlo search, 2006. Childs, B. E., Brodeur, J. H., and Kocsis, L. Transpositions Kozelek, T. Methods of mcts and the game , 2009. and move groups in monte carlo tree search. In 2008 IEEE Symposium On Computational Intelligence and Lee, K., Choi, S., and Oh, S. Sparse markov decision pro- Games. IEEE, 2008. cesses with causal sparse tsallis entropy regularization for Coquelin, P.-A. and Munos, R. Bandit algorithms for tree reinforcement learning. IEEE Robotics and Automation search. arXiv preprint cs/0703062, 2007. Letters, 3(3):1466–1473, 2018.

Coulom, R. Efficient selectivity and backup operators in Lorentz, R. J. Improving monte–carlo tree search in ha- monte-carlo tree search. In International conference on vannah. In International Conference on Computers and computers and games, pp. 72–83. Springer, 2006. Games, pp. 105–115. Springer, 2010. Convex Regularization in Monte-Carlo Tree Search

Mei, J., Xiao, C., Huang, R., Schuurmans, D., and Muller,¨ Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and M. On principled entropy exploration in policy optimiza- Klimov, O. Proximal policy optimization algorithms. tion. In Proceedings of the 28th International Joint Con- arXiv preprint arXiv:1707.06347, 2017b. ference on Artificial Intelligence, pp. 3130–3136. AAAI Press, 2019. Shalev-Shwartz, S. and Singer, Y. Convex repeated games and fenchel duality. Advances in neural information Mensch, A. and Blondel, M. Differentiable dynamic pro- processing systems, 19:1265–1272, 2006. gramming for structured prediction and attention. In International Conference on Machine Learning, pp. 3462– Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., 3471, 2018. Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, game of go with deep neural networks and tree search. J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidje- nature, 529(7587):484, 2016. land, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 518(7540): Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, 529–533, 2015. M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Grae- Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, pel, T., et al. Mastering chess and by self-play T., Harley, T., Silver, D., and Kavukcuoglu, K. Asyn- with a general reinforcement learning algorithm. arXiv chronous methods for deep reinforcement learning. In preprint arXiv:1712.01815, 2017a. International conference on machine learning, pp. 1928– 1937, 2016. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Montgomery, W. H. and Levine, S. Guided policy search Bolton, A., et al. Mastering the game of go without via approximate mirror descent. In Advances in Neural human knowledge. Nature, 550(7676):354–359, 2017b. Information Processing Systems, pp. 4008–4016, 2016. Sutton, R. S. and Barto, A. G. Introduction to reinforcement Nachum, O. and Dai, B. Reinforcement learning via fenchel- learning, volume 135. MIT press Cambridge, 1998. rockafellar duality. CoRR, abs/2001.01866, 2020a. Tesauro, G., Rajan, V., and Segal, R. Bayesian inference in Nachum, O. and Dai, B. Reinforcement learning via fenchel- monte-carlo tree search. arXiv preprint arXiv:1203.3519, rockafellar duality. arXiv preprint arXiv:2001.01866, 2012. 2020b. Niculae, V. and Blondel, M. A regularized framework for Teytaud, F. and Teytaud, O. On the huge benefit of decisive sparse and structured neural attention. arXiv preprint moves in monte-carlo tree search algorithms. In Pro- arXiv:1705.07704, 2017. ceedings of the 2010 IEEE Conference on Computational Intelligence and Games, pp. 359–364. IEEE, 2010. Pavel, L. An extension of duality to a game-theoretic frame- work. Automatica, 43(2):226 – 237, 2007. Tom, D. Investigating uct and rave: Steps towards a more robust method, 2010. Rummery, G. A. Problem solving with reinforcement learn- ing. PhD thesis, University of Cambridge Ph. D. disserta- Van Hasselt, H., Guez, A., and Silver, D. Deep reinforce- tion, 1995. ment learning with double q-learning. In Thirtieth AAAI Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., conference on artificial intelligence, 2016. Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, Vodopivec, T., Samothrakis, S., and Ster, B. On monte D., Graepel, T., Lillicrap, T., and Silver, D. Mastering carlo tree search and reinforcement learning. Journal of atari, go, chess and shogi by planning with a learned Artificial Intelligence Research, 60:881–936, 2017. model, 2019. Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, Wainwright, M. J. High-dimensional statistics: A non- P. Trust region policy optimization. In International Con- asymptotic viewpoint, volume 48. Cambridge University ference on Machine Learning (ICML), pp. 1889–1897, Press, 2019. 2015. Xiao, C., Huang, R., Mei, J., Schuurmans, D., and Muller,¨ Schulman, J., Chen, X., and Abbeel, P. Equivalence be- M. Maximum entropy monte-carlo planning. In Ad- tween policy gradients and soft q-learning. arXiv preprint vances in Neural Information Processing Systems, pp. arXiv:1704.06440, 2017a. 9516–9524, 2019. Convex Regularization in Monte-Carlo Tree Search

Yee, T., Lisy,` V., Bowling, M. H., and Kambhampati, S. Monte carlo tree search in continuous action spaces with execution uncertainty. In IJCAI, pp. 690–697, 2016. Zimmert, J. and Seldin, Y. An optimal algorithm for stochas- tic and adversarial bandits. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 467–475. PMLR, 2019. Convex Regularization in Monte-Carlo Tree Search A. Proofs In this section, we describe how to derive the theoretical results presented in the paper. First, the exponential convergence rate of the estimated value function to the conjugate regularized value function at the root node (Theorem 1) is derived based on induction with respect to the depth D of the tree. When D = 1, we derive the concentration of the average reward at the leaf node with respect to the ∞-norm (as shown in Lemma 1) based on the result from Theorem 2.19 in (Wainwright, 2019), and the induction is done over the tree by additionally exploiting the contraction property of the convex regularized value function. Second, based on Theorem 1, we prove the exponential convergence rate of choosing the best action at the root node (Theorem 2). Third, the pseudo-regret analysis of E3W is derived based on the Bregman divergence properties and the contraction properties of the Legendre-Fenchel transform (Proposition 1). Finally, the bias error of estimated value at the root node is derived based on results of Theorem 1, and the boundedness property of the Legendre-Fenchel transform (Proposition 1). Let rˆ and r be respectively the average and the the expected reward at the leaf node, and the reward distribution at the leaf node be σ2-sub-Gaussian.

Lemma 1 For the stochastic bandit problem E3W guarantees that, for t ≥ 4, 2σ  t  k r − rˆ k ≥  ≤ 4|A| exp − . P t ∞ log(2 + t) (log(2 + t))3

ˆ Pt Proof 1 Let us define Nt(a) as the number of times action a have been chosen until time t, and Nt(a) = s=1 πs(a), |A| where πs(a) is the E3W policy at time step s. By choosing λs = log(1+s) , it follows that for all a and t ≥ 4,

t t t X X 1 X 1 s/(s + 1) Nˆ (a) = π (a) ≥ ≥ − t s log(1 + s) log(1 + s) (log(1 + s))2 s=1 s=1 s=1 Z 1+t 1 s/(s + 1) 1 + t 1 t ≥ − 2 ds = − ≥ . 1 log(1 + s) (log(1 + s)) log(2 + t) log 2 2 log(2 + t) From Theorem 2.19 in (Wainwright, 2019), we have the following concentration inequality:

2 2 ˆ  2 P(|Nt(a) − Nt(a)| > ) ≤ 2 exp{− } ≤ 2 exp{− }, Pt 2 t 2 s=1 σs

2 where σs ≤ 1/4 is the variance of a Bernoulli distribution with p = πs(k) at time step s. We define the event

E = {∀a ∈ A, |Nˆt(a) − Nt(a)| ≤ }, and consequently

22 (|Nˆ (a) − N (a)| ≥ ) ≤ 2|A| exp(− ). (18) P t t t

t t Conditioned on the event E, for  = 4 log(2+t) , we have Nt(a) ≥ 4 log(2+t) . For any action a by the definition of sub-gaussian, s s 2 2 ! 2 2 ! 8σ log( δ ) log(2 + t) 2σ log( δ ) P |r(a) − rˆt(a)| > ≤ P |r(a) − rˆt(a)| > ≤ δ t Nt(a)

2 1 by choosing a δ satisfying log( δ ) = (log(2+t))3 , we have s 2 2 ! ! 2σ log( δ ) 1 P |r(a) − rˆt(a)| > ≤ 2 exp − 3 . Nt(a) (log(2 + t)) Convex Regularization in Monte-Carlo Tree Search

Therefore, for t ≥ 2 ! ! 2σ 2σ C P k r − rˆt k∞> ≤ P k r − rˆt k∞> E + P(E ) log(2 + t) log(2 + t) ! !! X 2σ 1 ≤ |r(a) − rˆ (a)| > + (EC ) ≤ 2|A| exp − P t log(2 + t) P  (log(2 + t))3 k ! ! t t + 2|A| exp − = 4|A| exp − . (log(2 + t))3 (log(2 + t))3

Lemma 2 Given two policies π(1) = ∇Ω∗(r(1)) and π(2) = ∇Ω∗(r(2)), ∃L, such that

(1) (2) (1) (2) k π − π kp≤ L k r − r kp .

Proof 2 This comes directly from the fact that π = ∇Ω∗(r) is Lipschitz continuous with `p-norm. Note that p has different values according to the choice of regularizer. Refer to (Niculae & Blondel, 2017) for a discussion of each norm using maximum entropy and Tsallis entropy regularizer. Relative entropy shares the same properties with maximum Entropy.

∗ Lemma 3 Consider the E3W policy applied to a tree. At any node s of the tree with depth d, Let us define Nt (s, a) = ∗ ˆ Pt ˆ π (a|s).t, and Nt(s, a) = s=1 πs(a|s), where πk(a|s) is the policy at time step k. There exists some C and C such that Ct t |Nˆ (s, a) − N ∗(s, a)| >  ≤ Cˆ|A|t exp{− }. P t t log t (log t)3

Proof 3 We denote the following event, 2σ E = {k r(s0, ·) − rˆ (s0, ·) k < }. rk k ∞ log(2 + k)

Tt ˆ ∗ Thus, conditioned on the event i=1 Ert and for t ≥ 4, we bound |Nt(s, a) − Nt (s, a)| as

t t ˆ ∗ X ∗ X |Nt(s, a) − Nt (s, a)| ≤ |πˆk(a|s) − π (a|s)| + λk k=1 k=1 t t X ∗ X ≤ k πˆk(·|s) − π (·|s) k∞ + λk k=1 k=1 t t X ∗ X ≤ k πˆk(·|s) − π (·|s) kp + λk k=1 k=1 t t X 0 0 X ≤ L k Qˆk(s , ·) − Q(s , ·) kp + λk(Lemma 2) k=1 k=1 t t 1 X 0 0 X ≤ L|A| p k Qˆk(s , ·) − Q(s , ·) k∞ + λk( Property of p-norm) k=1 k=1 t t 1 d X 00 00 X ≤ L|A| p γ k rˆk(s , ·) − r(s , ·) k∞ + λk(Contraction 3.1) k=1 k=1 t t 1 d X 2σ X ≤ L|A| p γ + λ log(2 + k) k k=1 k=1 Z t Z t 1 d 2σ |A| ≤ L|A| p γ dk + dk k=0 log(2 + k) k=0 log(1 + k) Ct ≤ . log t Convex Regularization in Monte-Carlo Tree Search for some constant C depending on |A|, p, d, σ, L, and γ . Finally,

t t ∗ Ct X c X t (|Nˆt(s, a) − N (s, a)| ≥ ) ≤ (E ) = 4|A| exp(− ) P t log t P rt (log(2 + t))3 i=1 i=1 t ≤ 4|A|t exp(− ) (log(2 + t))3 t = O(t exp(− )). (log(t))3

∗ ∗ Lemma 4 Consider the E3W policy applied to a tree. At any node s of the tree, Let us define Nt (s, a) = π (a|s).t, and Nt(s, a) as the number of times action a have been chosen until time step t. There exists some C and Cˆ such that

Ct t |N (s, a) − N ∗(s, a)| >  ≤ Ctˆ exp{− }. P t t log t (log t)3

Proof 4 Based on the result from Lemma3, we have

t t |N (s, a) − N ∗(s, a)| > (1 + C)  ≤ Ct exp{− } P t t log t (log t)3 Ct t ≤ |Nˆ (s, a) − N ∗(s, a)| >  + |N (s, a) − Nˆ (s, a)| >  P t t log t P t t log t t t ≤ 4|A|t exp{− } + 2|A| exp{− }(Lemma 3 and (18)) (log(2 + t))3 (log(2 + t))2 t ≤ O(t exp(− )). (log t)3

Theorem 1 At the root node s of the tree, defining N(s) as the number of visitations and VΩ∗ (s) as the estimated value at node s, for  > 0, we have

∗ N(s) P(|VΩ(s) − VΩ (s)| > ) ≤ C exp{− }. Cˆ(log(2 + N(s)))2

Proof 5 We prove this concentration inequality by induction. When the depth of the tree is D = 1, from Proposition1, we get

∗ ∗ ∗ ∗ ∗ |VΩ(s) − VΩ (s)| =k Ω (QΩ(s, ·)) − Ω (QΩ(s, ·)) k∞≤ γ k rˆ − r k∞ (Contraction) where rˆ is the average rewards and r∗ is the mean reward. So that

∗ ∗ P(|VΩ(s) − VΩ (s)| > ) ≤ P(γ k rˆ − r k∞> ).

2σγ From Lemma 1, with  = log(2+N(s)) , we have

N(s) (|V (s) − V ∗(s)| > ) ≤ (γ k rˆ − r∗ k > ) ≤ 4|A| exp{− } P Ω Ω P ∞ 2σγ(log(2 + N(s)))2 N(s) = C exp{− }. Cˆ(log(2 + N(s)))2

Let assume we have the concentration bound at the depth D − 1, Let us define VΩ(sa) = QΩ(s, a), where sa is the state reached taking action a from state s. then at depth D − 1

∗ N(sa) (|VΩ(sa) − V (sa)| > ) ≤ C exp{− }. (19) P Ω 2 Cˆ(log(2 + N(sa))) Convex Regularization in Monte-Carlo Tree Search

Now at the depth D, because of the Contraction Property, we have

∗ ∗ |VΩ(s) − VΩ (s)| ≤ γ k QΩ(s, ·) − QΩ(s, ·) k∞ ∗ = γ|QΩ(s, a) − QΩ(s, a)|. So that

∗ ∗ P(|VΩ(s) − VΩ (s)| > ) ≤ P(γ k QΩ(s, a) − QΩ(s, a) k> ) N(sa) ≤ Ca exp{− } 2 Cˆa(log(2 + N(sa))) N(sa) ≤ Ca exp{− }. 2 Cˆa(log(2 + N(s)))

From (19), we can have limt→∞ N(sa) = ∞ because if ∃L, N(sa) < L, we can find  > 0 for which (19) is not satisfied. ∗ 1 ∗ From Lemma 4, when N(s) is large enough, we have N(sa) → π (a|s)N(s) (for example N(sa) > 2 π (a|s)N(s)), that means we can find C and Cˆ that satisfy

∗ N(s) P(|VΩ(s) − VΩ (s)| > ) ≤ C exp{− }. Cˆ(log(2 + N(s)))2

Lemma 5 At any node s of the tree, N(s) is the number of visitations. We define the event

N ∗(s, a) E = {∀a ∈ A, |N(s, a) − N ∗(s, a)| < } where N ∗(s, a) = π∗(a|s)N(s), s 2 where  > 0 and VΩ∗ (s) is the estimated value at node s. We have

∗ N(s) P(|VΩ(s) − VΩ (s)| > |Es) ≤ C exp{− }. Cˆ(log(2 + N(s)))2

Proof 6 The proof is the same as in Theorem 2. We prove the concentration inequality by induction. When the depth of the tree is D = 1, from Proposition1, we get

∗ ∗ ∗ ∗ ∗ |VΩ(s) − VΩ (s)| =k Ω (QΩ(s, ·)) − Ω (QΩ(s, ·)) k≤ γ k rˆ − r k∞ (Contraction Property) where rˆ is the average rewards and r∗ is the mean rewards. So that

∗ ∗ P(|VΩ(s) − VΩ (s)| > ) ≤ P(γ k rˆ − r k∞> ).

2σγ From Lemma 1, with  = log(2+N(s)) and given Es, we have

N(s) (|V (s) − V ∗(s)| > ) ≤ (γ k rˆ − r∗ k > ) ≤ 4|A| exp{− } P Ω Ω P ∞ 2σγ(log(2 + N(s)))2 N(s) = C exp{− }. Cˆ(log(2 + N(s)))2

Let assume we have the concentration bound at the depth D − 1, Let us define VΩ(sa) = QΩ(s, a), where sa is the state reached taking action a from state s, then at depth D − 1

∗ N(sa) (|VΩ(sa) − V (sa)| > ) ≤ C exp{− }. P Ω 2 Cˆ(log(2 + N(sa)))

Now at depth D, because of the Contraction Property and given Es, we have

∗ ∗ |VΩ(s) − VΩ (s)| ≤ γ k QΩ(s, ·) − QΩ(s, ·) k∞ ∗ = γ|QΩ(s, a) − QΩ(s, a)|(∃a, satisfied). Convex Regularization in Monte-Carlo Tree Search

So that

∗ ∗ P(|VΩ(s) − VΩ (s)| > ) ≤ P(γ k QΩ(s, a) − QΩ(s, a) k> ) N(sa) ≤ Ca exp{− } 2 Cˆa(log(2 + N(sa)))

N(sa) ≤ Ca exp{− } 2 Cˆa(log(2 + N(s))) N(s) ≤ C exp{− }(because of Es) Cˆ(log(2 + N(s)))2 .

Theorem 2 Let at be the action returned by algorithm E3W at iteration t. Then for t large enough, with some constants C, Cˆ,

∗ t P(at 6= a ) ≤ Ct exp{− }. Cσˆ (log(t))3

∗ Proof 7 Let us define event Es as in Lemma 5. Let a be the action with largest value estimate at the root node state s. The probability that E3W selects a sub-optimal arm at s is

∗ X c P(at 6= a ) ≤ P(VΩ(sa)) > VΩ(sa∗ )|Es) + P(Es) a X ∗ ∗ ∗ ∗ c = P((VΩ(sa) − VΩ (sa)) − (VΩ(sa∗ ) − VΩ (sa∗ )) ≥ VΩ (sa∗ ) − VΩ (sa)|Es) + P(Es). a

∗ ∗ Let us define ∆ = VΩ (sa∗ ) − VΩ (sa), therefore for ∆ > 0, we have

∗ X ∗ ∗ c P(at 6= a ) ≤ P((VΩ(sa) − VΩ (sa)) − (VΩ(sa∗ ) − VΩ (sa∗ )) ≥ ∆|Es) + +P(Es) a X ∗ ∗ c ≤ P(|VΩ(sa) − VΩ (sa)| ≥ α∆|Es) + P(|VΩ(sa∗ ) − VΩ (sa∗ )| ≥ β∆|Es) + P(Es) a

X N(s)(α∆) N(s)(β∆) c ≤ Ca exp{− } + Ca∗ exp{− } + (E ), ˆ 2 ˆ 2 P s a Ca(log(2 + N(s))) Ca∗ (log(2 + N(s))) where α+β = 1, α > 0, β > 0, and N(s) is the number of visitations the root node s. Let us define 1 = min{ (α∆) , (β∆) }, Cˆ Ca Ca∗ 1 and C = |A| max{Ca,Ca∗ } we have

∗ t c P(a 6= a ) ≤ C exp{− } + P(Es). Cσˆ (log(2 + t))2

0 From Lemma 4, ∃C , Cˆ0 for which

c 0 t P(Es) ≤ C t exp{− }, Cˆ0 (log(t))3 so that t (a 6= a∗) ≤ O(t exp{− }). P (log(t))3

∗ ∗ ∗ Theorem 3 Consider an E3W policy applied to the tree. Let define DΩ∗ (x, y) = Ω (x) − Ω (y) − ∇Ω (y)(x − y) as the Bregman divergence between x and y, The expected pseudo regret Rn satisfies

n X n [R ] ≤ −τΩ(ˆπ) + D ∗ (Vˆ (·) + V (·), Vˆ (·)) + O( ). E n Ω t t log n t=1 Convex Regularization in Monte-Carlo Tree Search

Proof 8 Without loss of generality, we can assume that Vi ∈ [−1, 0], ∀i ∈ [1, |A|]. as the definition of regret, we have n n n ∗ X ˆ X X E[Rn] = nV − hπˆt(·),V (·)i ≤ V1(0) − hπˆt(·),V (·)i ≤ −τΩ(ˆπ) − hπˆt(·),V (·)i . t=1 t=1 t=1 By the definition of the tree policy, we can obtain n n n   X X D E X λt(·) − hπˆ (·),V (·)i = − (1 − λ )∇Ω∗(Vˆ (·)),V (·) − ,V (·) t t t |A| t=1 t=1 t=1 n n   X D E X λt(·) = − (1 − λ )∇Ω∗(Vˆ (·)),V (·) − ,V (·) t t |A| t=1 t=1 n n   X D E X λt(·) ≤ − ∇Ω∗(Vˆ (·)),V (·) − ,V (·) . t |A| t=1 t=1 with n n n n X D ∗ E X ∗ X ∗ X D ∗ E − ∇Ω (Vˆt(·)),V (·) = Ω (Vˆt(·) + V (·)) − Ω (Vˆt(·)) − ∇Ω (Vˆt(·)),V (·) t=1 t=1 t=1 t=1 n n X ∗ X ∗ − ( Ω (Vˆt(·) + V (·)) − Ω (Vˆt(·))) t=1 t=1 n n n X X ∗ X ∗ = DΩ∗ (Vˆt(·) + V (·), Vˆt(·)) − ( Ω (Vˆt(·) + V (·)) − Ω (Vˆt(·))) t=1 t=1 t=1 n X ≤ DΩ∗ (Vˆt(·) + V (·), Vˆt(·)) + n k V (·) k∞ (Contraction property, Proposition 1) t=1 n X ≤ DΩ∗ (Vˆt(·) + V (·), Vˆt(·)).( because Vi ≤ 0) t=1 And n   n X λt(·) n X 1 n − ,V (·) ≤ O( ), (Because → O( )) |A| log n log(k + 1) log n t=1 k=1 So that n X n [R ] ≤ −τΩ(ˆπ) + D ∗ (Vˆ (·) + V (·), Vˆ (·)) + O( ). E n Ω t t log n t=1 1 P α We consider the generalized Tsallis Entropy Ω(π) = Sα(π) = 1−α (1 − i π (ai|s)). According to (Abernethy et al., 2015), when α ∈ (0, 1)

−1 α DΩ∗ (Vˆt(·) + V (·), Vˆt(·)) ≤ (τα) |A| 1 −Ω(ˆπ ) ≤ (|A|1−α − 1). n 1 − α Then, for the generalized Tsallis Entropy, when α ∈ (0, 1), the regret is τ n [R ] ≤ (|A|1−α − 1) + n(τα)−1|A|α + O( ), E n 1 − α log n when α = 2, which is the Tsallis entropy case we consider, according to (Zimmert & Seldin, 2019), By Taylor’s theorem ∃z ∈ conv(Vˆt, Vˆt + V ), we have

1 2 ∗ |K| D ∗ (Vˆ (·) + V (·), Vˆ (·)) ≤ V (·), ∇ Ω (z)V (·) ≤ . Ω t t 2 2 Convex Regularization in Monte-Carlo Tree Search

So that when α = 2, we have |A| − 1 n|K| n [R ] ≤ τ( ) + + O( ). E n |A| 2 log n when α = 1, which is the maximum entropy case in our paper, we derive.

n|A| n [R ] ≤ τ(log |A|) + + O( ) E n τ log n

Finally, when the convex regularizer is relative entropy, One can simply write KL(πt||πt−1) = −H(πt) − Eπt log πt−1, let m = mina πt−1(a|s), we have

1 n|A| n [R ] ≤ τ(log |A| − ) + + O( ). E n m τ log n

Before derive the next theorem, we state the Theorem 2 in (Geist et al., 2019)

• Boundedness: for two constants LΩ and UΩ such that for all π ∈ Π, we have LΩ ≤ Ω(π) ≤ UΩ, then

τ(U − L ) V ∗(s) − Ω Ω ≤ V ∗(s) ≤ V ∗(s). (20) 1 − γ Ω

Where τ is the temperature and γ is the discount constant.

Theorem 4 For any δ > 0, with probability at least 1 − δ, the εΩ satisfies s s Cσˆ 2 log C τ(U − L ) Cσˆ 2 log C − δ − Ω Ω ≤ ε ≤ δ . 2N(s) 1 − γ Ω 2N(s)

r 2 Cσˆ 2 log C Proof 9 From Theorem 2, let us define δ = C exp{− 2N(s) }, so that  = δ then for any δ > 0, we have Cσˆ 2 2N(s) s Cσˆ 2 log C (|V (s) − V ∗(s)| ≤ δ ) ≥ 1 − δ. P Ω Ω 2N(s)

Then, for any δ > 0, with probability at least 1 − δ, we have s Cσˆ 2 log C |V (s) − V ∗(s)| ≤ δ Ω Ω 2N(s) s s Cσˆ 2 log C Cσˆ 2 log C − δ ≤ V (s) − V ∗(s) ≤ δ 2N(s) Ω Ω 2N(s) s s Cσˆ 2 log C Cσˆ 2 log C − δ + V ∗(s) ≤ V (s) ≤ δ + V ∗(s). 2N(s) Ω Ω 2N(s) Ω

From Proposition 1, we have s s Cσˆ 2 log C τ(U − L ) Cσˆ 2 log C − δ + V ∗(s) − Ω Ω ≤ V (s) ≤ δ + V ∗(s). 2N(s) 1 − γ Ω 2N(s)