Combining Off and On-Policy Training in Model-Based Reinforcement Learning
Total Page:16
File Type:pdf, Size:1020Kb
Combining Off and On-Policy Training in Model-Based Reinforcement Learning Alexandre Borges Arlindo L. Oliveira INESC-ID / Instituto Superior Técnico INESC-ID / Instituto Superior Técnico University of Lisbon University of Lisbon Lisbon, Portugal Lisbon, Portugal ABSTRACT factor and game length. AlphaGo [1] was the first computer pro- The combination of deep learning and Monte Carlo Tree Search gram to beat a professional human player in the game of Go by (MCTS) has shown to be effective in various domains, such as combining deep neural networks with tree search. In the first train- board and video games. AlphaGo [1] represented a significant step ing phase, the system learned from expert games, but a second forward in our ability to learn complex board games, and it was training phase enabled the system to improve its performance by rapidly followed by significant advances, such as AlphaGo Zero self-play using reinforcement learning. AlphaGoZero [2] learned [2] and AlphaZero [3]. Recently, MuZero [4] demonstrated that only through self-play and was able to beat AlphaGo soundly. Alp- it is possible to master both Atari games and board games by di- haZero [3] improved on AlphaGoZero by generalizing the model rectly learning a model of the environment, which is then used to any board game. However, these methods were designed only with Monte Carlo Tree Search (MCTS) [5] to decide what move for board games, specifically for two-player zero-sum games. to play in each position. During tree search, the algorithm simu- Recently, MuZero [4] improved on AlphaZero by generalizing lates games by exploring several possible moves and then picks the it even more, so that it can learn to play singe-agent games while, action that corresponds to the most promising trajectory. When at the same time, learning a model of the environment. MuZero training, limited use is made of these simulated games since none is more flexible than any of its predecessors and can both master of their trajectories are directly used as training examples. Even Atari games and board games. It can also be used in environments if we consider that not all trajectories from simulated games are where we do not have access to a simulator of the environment. useful, there are thousands of potentially useful trajectories that are Motivation. All these algorithms, regardless of their successes, discarded. Using information from these trajectories would provide are costly to train. AlphaZero took three days to achieve super- more training data, more quickly, leading to faster convergence and human performance in the game of Go, using a total of 5064 TPUs higher sample efficiency. Recent work [6] introduced an off-policy (Tensor Processing Units). MuZero, while being more efficient, still value target for AlphaZero that uses data from simulated games. used 40 TPU for Atari Games and 1016 TPU for board games. In this work, we propose a way to obtain off-policy targets using Both AlphaZero and MuZero use MCTS [5] to decide what move data from simulated games in MuZero. We combine these off-policy to play in each position. During tree search, the algorithm simulates targets with the on-policy targets already used in MuZero in several games by exploring several possible moves and then selects the ac- ways, and study the impact of these targets and their combinations tion that corresponds to the most promising trajectory. Even though in three environments with distinct characteristics. When used not all trajectories from these simulated games correspond to good in the right combinations, our results show that these targets can moves, they contain information useful for training. Therefore, speed up the training process and lead to faster convergence and these trajectories could provide more data to the learning system, higher rewards than the ones obtained by MuZero. enabling it to learn more quickly and leading to faster convergence and higher sample efficiency. KEYWORDS In this work, inspired by recent work [6] that introduced an off-policy value target for AlphaZero, we present three main con- Deep Reinforcement Learning, Model-Based Learning, MCTS, MuZero, tributions: Off-Policy Learning, On-Policy Learning • 1 INTRODUCTION We propose a way to obtain off-policy targets by using data from the MCTS tree in MuZero. Board games have often been solved using planning algorithms, • We combine these off-policy targets with the on-policy tar- more specifically, using tree search with handcrafted heuristics. gets already used in MuZero in several ways. TD-Gammon [7] demonstrated that it is possible to learn a posi- • We study the impact of using these targets and their combi- tion evaluation function and a policy from self-play to guide the nations in three environments with distinct characteristics. tree search instead of using handcrafted heuristics and achieved super-human performance in the game of backgammon. However, the success of TD-Gammon was hard to replicate in more complex The rest of this work is organized as follows. Section 2 presents games such as chess. When Deep Blue [8] was able to beat the the background and related work of relevance for the proposed then world chess champion Garry Kasparov, it still used tree search extensions. Section 3 explains the proposed extensions and Section guided by handcrafted heuristics. These approaches failed for even 4 presents the results. Finally, Section 5 presents the main takeaway more complex games, such as Go, because of its high branching messages and points to possible directions for future work. 2 BACKGROUND AND RELATED WORK Backward Q-learning uses tabular methods, explicitly storing 2.1 Reinforcement Learning the relevant values in tables. When it comes to deep reinforcement learning, Q-learning targets have been combined with Monte Carlo Reinforcement learning is an area of machine learning that deals targets [16]: with the task of sequential decision making. In these problems, we have an agent that learns through interactions in an environment. ~ = V~>=_?>;82~_"퐶 ¸ ¹1 − Vº~@_;40A=8=6 (1) We can describe the environment as a Markov Decision Process where ~ is calculated directly from the rewards of (MDP). An MDP is a 5-tuple (S, A, %, ', W) where: S is a set of >=_?>;82~_"퐶 complete episodes in the replay buffer, ~ is a 1-step Q- states; A is a set of actions; % : S × A × S ! »0, 1¼ is the transition @_;40A=8=6 learning target, and V is a parameter to control mixing between function that determines the probability of transitioning to a state targets. The authors tested it in Deep Q-Learning (DQN) [17] and given an action; ' : S × A × S ! R is the reward function that Deep Deterministic Policy Gradient (DDPG) [18]. This method for a given state B , action 0 and next state B ¸ triplet returns the C C C 1 improved learning and stability in DDPG. However, in DQN it reward; and W 2 »0, 1º is a discounting factor for future rewards. hindered training across four out of five Atari games. Interactions with the environment can be broken into episodes. An episode is composed of several timesteps. In each timestep C, the 2.3 MuZero agent observes a state BC , takes an action 0C and receives a reward AC from the environment which now transitions into a new state MuZero learns a model of the environment that is then used for BC¸1. planning. The model is trained to predict the most relevant values The learning objective for the agent is to maximize the reward for planning which in this case are the reward, the policy and the over the long run. The agent can solve a reinforcement learning value function. problem by learning one or more of the following things: a policy 2.3.1 Network. MuZero uses three functions to be able to model c ¹Bº; a value function + ¹Bº; or a model of the environment. One the dynamics of the environment and to plan. or more of these can then be used to plan a course of action that maximizes the reward. A policy c : S × A ! »0, 1¼ determines the • Representation function ℎ\ : takes as input the past obser- 0 probability that the agent will take an action in a particular state. vations >1, ..., >C and outputs a hidden state B that will be A value function +c : S! R evaluates how good a state is based the root node used for planning. on the expected discounted sum of the rewards for a certain policy, • Dynamics function 6\ : takes as input the previous hidden Í1 : state B:−1 and an action 0: , and outputs the next hidden Ec » ~ AC¸:¸1j(C = B¼. A model of the environment includes :=0 : : the transition function % between states and the reward function ' state B and immediate reward A for each state. • Prediction function 5\ : this is the same as in AlphaZero. : : Another function of interest is the action-value function &c : It takes as input a hidden state B and outputs the policy ? S × A ! R that represents the value of taking action 0 in state B and value E: and then following policy c for the remaining timesteps. Note that there is no requirement to be able to obtain the original Model-free algorithms are algorithms that are value and/or policy- observations from the hidden state. The only requirement for the based, meaning that they learn a value function and/or a policy. hidden state is that it is represented in such a way as to predict the Algorithms that are dependent on a model of the environment are values necessary for planning accurately. said to be model-based. The agent interacts with the environment using a certain policy 2.3.2 Training.