Convex Regularization in Monte-Carlo Tree Search

Convex Regularization in Monte-Carlo Tree Search Tuan Dam 1 Carlo D’Eramo 1 Jan Peters 1 2 Joni Pajarinen 1 3 Abstract 1. Introduction Monte-Carlo Tree Search (MCTS) is a well-known algorithm to solve decision-making problems through the com- Monte-Carlo planning and Reinforcement Learn- bination of Monte-Carlo planning and an incremental tree ing (RL) are essential to sequential decision mak- structure (Coulom, 2006). MCTS provides a principled ap- ing. The recent AlphaGo and AlphaZero algo- proach for trading off between exploration and exploitation rithms have shown how to successfully combine in sequential decision making. Moreover, recent advances these two paradigms to solve large scale sequen- have shown how to enable MCTS in continuous and large tial decision problems. These methodologies ex- problems (Silver et al., 2016; Yee et al., 2016). Most remark- ploit a variant of the well-known UCT algorithm ably, AlphaGo (Silver et al., 2016) and AlphaZero (Silver to trade off the exploitation of good actions and et al., 2017a;b) couple MCTS with neural networks trained the exploration of unvisited states, but their em- using Reinforcement Learning (RL) (Sutton & Barto, 1998) pirical success comes at the cost of poor sample- methods, e.g., Deep Q-Learning (Mnih et al., 2015), to efficiency and high computation time. In this speed up learning of large scale problems. In particular, a paper, we overcome these limitations by intro- neural network is used to compute value function estimates ducing the use of convex regularization in Monte- of states as a replacement of time-consuming Monte-Carlo Carlo Tree Search (MCTS) to drive exploration ef- rollouts, and another neural network is used to estimate ficiently and to improve policy updates. First, we policies as a probability prior for the therein introduced introduce a unifying theory on the use of generic PUCT action selection strategy, a variant of well-known convex regularizers in MCTS, deriving the first UCT sampling strategy commonly used in MCTS for ex- regret analysis of regularized MCTS and showing ploration (Kocsis et al., 2006). Despite AlphaGo and Al- that it guarantees an exponential convergence rate. phaZero achieving state-of-the-art performance in games Second, we exploit our theoretical framework to with high branching factor like Go (Silver et al., 2016) and introduce novel regularized backup operators for Chess (Silver et al., 2017a), both methods suffer from poor MCTS, based on the relative entropy of the policy sample-efficiency, mostly due to the polynomial conver- update and, more importantly, on the Tsallis en- gence rate of PUCT (Xiao et al., 2019). This problem, tropy of the policy, for which we prove superior combined with the high computational time to evaluate the theoretical guarantees. We empirically verify the deep neural networks, significantly hinder the applicability consequence of our theoretical results on a toy of both methodologies. problem. Finally, we show how our framework can easily be incorporated in AlphaGo and we In this paper, we provide a theory of the use of convex regularization in MCTS, which proved to be an efficient arXiv:2007.00391v3 [cs.LG] 16 Feb 2021 empirically show the superiority of convex regularization, w.r.t. representative baselines, on well- solution for driving exploration and stabilizing learning in known RL problems across several Atari games. RL (Schulman et al., 2015; 2017a; Haarnoja et al., 2018; Buesing et al., 2020). In particular, we show how a regularized objective function in MCTS can be seen as an instance of the Legendre-Fenchel transform, similar to pre- vious findings on the use of duality in RL (Mensch & Blon- 1Department of Computer Science, Technische Univer- sitat¨ Darmstadt, Germany 2Robot Learning Group, Max del, 2018; Geist et al., 2019; Nachum & Dai, 2020a) and Planck Institute for Intelligent Systems,Tubingen,¨ Germany game theory (Shalev-Shwartz & Singer, 2006; Pavel, 2007). 3Computing Sciences, Alto University, Finland. Correspondence Establishing our theoretical framework, we can derive the to: Tuan Dam <[email protected]>, Carlo D’Eramo first regret analysis of regularized MCTS, and prove that a <[email protected]>, Jan Peters <mail@jan- generic convex regularizer guarantees an exponential con- peters.net>, Joni Pajarinen <[email protected]>. vergence rate to the solution of the regularized objective function, which improves on the polynomial rate of PUCT. Convex Regularization in Monte-Carlo Tree Search These results provide a theoretical ground for the use of are the visited states of the MDP, and the edges are the ac- arbitrary entropy-based regularizers in MCTS until now tions executed in each state. MCTS converges to the optimal limited to maximum entropy (Xiao et al., 2019), among policy (Kocsis et al., 2006; Xiao et al., 2019), iterating over which we specifically study the relative entropy of policy a loop composed of four steps: updates, drawing on similarities with trust-region and prox- imal methods in RL (Schulman et al., 2015; 2017b), and 1. Selection: starting from the root node, a tree-policy is the Tsallis entropy, used for enforcing the learning of sparse executed to navigate the tree until a node with unvisited policies (Lee et al., 2018). Moreover, we provide an em- children, i.e. expandable node, is reached; pirical analysis of the toy problem introduced in Xiao et al. (2019) to evince the practical consequences of our theo- 2. Expansion: the reached node is expanded according retical results for each regularizer. Finally, we empirically to the tree policy; evaluate the proposed operators in AlphaGo, on several 3. Simulation: run a rollout, e.g. Monte-Carlo simula- Atari games, confirming the benefit of convex regularization tion, from the visited child of the current node to the in MCTS, and in particular the superiority of Tsallis entropy end of the episode; w.r.t. other regularizers. 4. Backup: use the collected reward to update the action- 2. Preliminaries values Q(·) of the nodes visited in the trajectory from the root node to the expanded node. 2.1. Markov Decision Processes We consider the classical definition of a finite- The tree-policy used to select the action to execute in each horizon Markov Decision Process (MDP) as a 5-tuple node needs to balance the use of already known good ac- M = hS; A; R; P; γi, where S is the state space, A is tions, and the visitation of unknown states. The Upper Con- fidence bounds for Trees (UCT) sampling strategy (Kocsis the finite discrete action space, R : S × A × S ! R is the reward function, P : S × A ! S is the transition et al., 2006) extends the use of the well-known UCB1 sam- kernel, and γ 2 [0; 1) is the discount factor. A policy pling strategy for multi-armed bandits (Auer et al., 2002), to MCTS. Considering each node corresponding to a state π 2 Π: S ×A ! R is a probability distribution of the event of executing an action a in a state s. A policy π induces a s 2 S as a different bandit problem, UCT selects an action value function corresponding to the expected cumulative a 2 A applying an upper bound to the action-value function discounted reward collected by the agent when executing s log N(s) action a in state s, and following the policy π thereafter: UCT(s; a) = Q(s; a) + ; (1) π P1 k N(s; a) Q (s; a) , E k=0 γ ri+k+1jsi = s; ai = a; π , where ri+1 is the reward obtained after the i-th transition. An MDP is solved finding the optimal policy π∗, which is the policy where N(s; a) is the number of executions of action a in P that maximizes the expected cumulative discounted reward. state s, N(s) = a N(s; a), and is a constant parameter The optimal policy corresponds to the one satisfying the to tune exploration. UCT asymptotically converges to the ∗ Q∗ optimal Bellman equation (Bellman, 1954) Q (s; a) , optimal action-value function , for all states and actions, R 0 0 ∗ 0 0 0 P(s js; a)[R(s; a; s ) + γ maxa0 Q (s ; a )] ds , with the probability of executing a suboptimal action at the S 1 and is the fixed point of the opti- root node approaching 0 with a polynomial rate O( t ), for a ∗ t mal Bellman operator T Q(s; a) , simulation budget (Kocsis et al., 2006; Xiao et al., 2019). R 0 0 0 0 0 S P(s js; a)[R(s; a; s ) + γ maxa0 Q(s ; a )] ds . Additionally, we define the Bellman oper- 3. Regularized Monte-Carlo Tree Search ator under the policy π as TπQ(s; a) , R 0 0 R 0 0 0 0 0 0 The success of RL methods based on entropy regulariza- S P(s js; a) R(s; a; s ) + γ A π(a js )Q(s ; a )da ds , ∗ ∗ tion comes from their ability to achieve state-of-the-art per- the optimal value function V (s) , maxa2A Q (s; a), and the value function under the policy π as formance in decision making and control problems, while π π enjoying theoretical guarantees and ease of implementa- V (s) , maxa2A Q (s; a). tion (Haarnoja et al., 2018; Schulman et al., 2015; Lee et al., 2018). However, the use of entropy regularization is MCTS 2.2. Monte-Carlo Tree Search and Upper Confidence is still mostly unexplored, although its advantageous explo- bounds for Trees ration and value function estimation would be desirable to Monte-Carlo Tree Search (MCTS) is a planning strategy reduce the detrimental effect of high-branching factor in based on a combination of Monte-Carlo sampling and tree AlphaGo and AlphaZero.

Load more