Visualizing Muzero Models

Visualizing MuZero Models Joery A. de Vries 1 2 Ken S. Voskuil 1 2 Thomas M. Moerland 1 Aske Plaat 1 Abstract model, which learns to predict future states, the MuZero MuZero, a model-based reinforcement learning dynamics model is trained to predict future values, better algorithm that uses a value equivalent dynamics known as a value equivalent model (Grimm et al., 2020). model, achieved state-of-the-art performance in A potential benefit of value equivalent models, compared Chess, Shogi and the game of Go. In contrast to to standard forward models, is that they will emphasize standard forward dynamics models that predict a value and reward relevant characteristics in their represen- full next state, value equivalent models are trained tation and dynamics. This may be beneficial when the true to predict a future value, thereby emphasizing dynamics are complicated, but the value relevant aspects value relevant information in the representations. of the dynamics are comparatively simple. As a second While value equivalent models have shown strong benefit, we train our model for its intended use: predicting empirical success, there is no research yet that value information during planning. Several papers have em- visualizes and investigates what types of repre- pirically investigated this principle in recent years (Tamar sentations these models actually learn. Therefore, et al., 2016; Oh et al., 2017; Farquhar et al., 2018; Silver in this paper we visualize the latent representa- et al., 2017b; Schrittwieser et al., 2020), while (Grimm et al., tion of MuZero agents. We find that action tra- 2020) provides a theoretical underpinning of this approach. jectories may diverge between observation em- beddings and internal state transition dynamics, However, no literature has yet investigated what kind of which could lead to instability during planning. representations these approaches actually learn, i.e., how the Based on this insight, we propose two regular- learned representations are organized. The goal of this paper ization techniques to stabilize MuZero’s perfor- is therefore to investigate and visualize environment models mance. Additionally, we provide an open-source learned by MuZero. Most interestingly, we find that, after implementation of MuZero along with an interac- training, an action trajectory that follows the forward dy- tive visualizer of learned representations, which namics model usually departs from the learned embedding may aid further investigation of value equivalent of the environment observations. In other words, MuZero algorithms. is not enforced to keep the state encoding and forward state prediction congruent. Therefore, the second goal of this paper is to regularize MuZero’s dynamics model to improve 1. Introduction its structure. We propose two regularization objectives to add to the MuZero objective, and experimentally show that Model-based reinforcement learning has shown strong em- these may indeed provide benefit. pirical success in sequential decision making tasks, as illus- trated by the AlphaZero (Silver et al., 2018) and MuZero In short, after introducing related work (Sec.2) and nec- algorithms (Schrittwieser et al., 2020). Both of these ap- essary background on the MuZero algorithm (Sec.3), we proaches nest a planning loop, based on Monte Carlo Tree discuss two research questions: 1) what type of representa- Search (Kocsis & Szepesvari´ , 2006; Browne et al., 2012), tion do value equivalent models learn (Sec.4), and 2) can inside a learning loop, where we approximate global value we use regularization to better structure the value equiva- and policy functions. While AlphaZero used a known model lent latent space (Sec.5)? We experimentally validate the of the environment, MuZero learned the model from sam- second question in Sec.6 and7. Moreover, apart from pled data. However, instead of a standard forward dynamics answering these two question, we also open source modu- lar MuZero code including an interactive visualizer of the 1Leiden Institute of Advanced Computer Science, Leiden, The latent space based on principal component analysis (PCA), 2 Netherlands Authors contributed equally. Correspondence to: available from www.anonymized.org. We found the Joery A. de Vries <[email protected]>, Ken S. Voskuil <[email protected]>. visualizer to greatly enhance our understanding of the algorithm, and believe visualization will be essential for deeper Proceedings of the 38 th International Conference on Machine understanding of this class of algorithms. Learning, PMLR 139, 2021. Copyright 2021 by the author(s). Visualizing MuZero Models 2. Related Work (Szymanski & McCane, 2011). Note that the most common approach to visualize neural network mappings is through Value equivalent models, a term introduced by (Grimm non-linear dimensionality reduction techniques, such as et al., 2020), are usually trained on end-to-end differen- Stochastic Neighbour Embedding (Hinton & Roweis, 2002). tiable computation graphs, although the principle would We instead focus on linear projections in low dimensional be applicable to gradient-free optimization as well. Typi- environments, as non-linear dimensionality reduction has cally, the unrolled computation graph makes multiple passes the risk of altering the semantics of the MDP models. through a dynamics model, and eventually predicts a value. Then, the dynamics model is trained through gradient de- scent on its ability to predict the correct value. The first 3. The MuZero Algorithm value equivalent approach were Value Iteration Networks We briefly introduce the MuZero algorithm (Schrittwieser (VIN) (Tamar et al., 2016), where a differentiable form of et al., 2020). We assume a Markov Decision Process (MDP) value iteration was embedded to predict a value. Other vari- specification given by the tuple hS; A; T ; U; γi, which re- ants of value equivalent approaches are Value Prediction spectively represent the set of states (S), the set of actions Networks (VPN) (Oh et al., 2017), TreeQN and ATreeC (A), the transition dynamics mapping state-action pairs to (Farquhar et al., 2018), the Predictron (Silver et al., 2017b), new states (T : S × A ! p(S)), the reward function map- and MuZero (Schrittwieser et al., 2020). These methods ping state-action pairs to rewards (U : S × A ! R), and a differ in the way they build their computation graph, where discount parameter (γ 2 [0; 1])(Sutton & Barto, 2018). VINS and TreeQN embed entire policy improvement (plan- Internally, we define an abstract MDP hS; A; T ; R; γi, ning) in the graph, VPNs, the Predictron and MuZero only e e S perform policy evaluation. Therefore, the latter approaches where e denotes an abstract state space, with correspond- T : S × A ! S combine explicit planning for policy improvement, which in ing dynamics e e e, and reward prediction R : S × A ! π : S! p(A) the case of MuZero happens through MCTS. (Grimm et al., e R. Our goal is to find a policy V (s) 2020) provides a theoretical analysis of value equivalent that maximizes the value from the start state, where V (s) models, showing that two value equivalent models give the is defined as the expected infinite-horizon cumulative same Bellman back-up. return: MuZero uses the learned value equivalent model to ex- 1 h X t i plicitly plan through Monte Carlo Tree Search (Kocsis & V (s) = Eπ;T γ · utjs0 = s : (1) Szepesvari´ , 2006; Browne et al., 2012), and uses the output t=0 of the search as training targets for a learned policy network. This idea of iterated planning and learning dates back to We define three distinct neural networks to approximate Dyna-2 (Silver et al., 2008), while the particularly success- the above MDPs (Figure1): the state encoding/embedding ful combination of MCTS and deep learning was introduced function hθ, the dynamics function gθ, and the prediction in AlphaGo Zero (Silver et al., 2017a) and Expert Iteration network fθ, where θ denotes the joint set of parameters of (ExIt) (Anthony et al., 2017). In general, planning may add the networks. The encoding function hθ : S! Se maps a to pure (model-free) reinforcement learning: 1) improved (sequence of) real MDP observations to a latent MDP state. action selection, and 2) improved (more stable) training The dynamics function gθ : Se × A ! Se × R predicts the targets. On the other hand, learning adds to planning the next latent state and the associated reward of the transition. ability to generalize information, and store global solutions In practice, we slightly abuse notation and also write gθ to in memory. For more detailed overviews of value equivalent only specify the next state prediction. Finally, the prediction models and iterated planning and learning we refer to the network fθ : S!e p(A) × R predicts both the policy and model-based RL surveys by (Moerland et al., 2020; Plaat value for some abstract state s~. We will identify the separate k k k et al., 2020). predictions of fθ(~st ) by pt and Vt , respectively, where subscripts denote the time index in the true environment, Visualization is a common approach to better understand ma- and superscripts index the timestep in the latent environment. chine learning methods, and visualization of representations Also, we write µθ = (hθ; gθ; fθ) for the joint model. and loss landscapes of (deep) neural networks has a long history (Bischof et al., 1992; Yosinski et al., 2015; Karpa- Together, these three networks can be chained to form a thy et al., 2015; Li et al., 2018a). For example, (Li et al., larger computation graph that follows a single trace, start- 2018b) shows how the loss landscape of a neural network ing from state st following action sequence (at; : : : ; at+n). can indicate smoothness of the optimization criterion. Visu- First, we use the embedding network to obtain the first 0 alization is also important in other areas of machine learning, latent state from the sequence of observations: s~t = for example to illustrate how kernel-methods project low di- hθ(s0; : : : ; st).

Visualizing Muzero Models

Game Changer

Improving Model-Based Reinforcement Learning with Internal State Representations Through Self-Supervision

Survey on Reinforcement Learning for Language Processing

Monte-Carlo Tree Search As Regularized Policy Optimization

Multi-Agent Reinforcement Learning: a Review of Challenges and Applications

Combining Off and On-Policy Training in Model-Based Reinforcement Learning

Efficiently Mastering the Game of Nogo with Deep Reinforcement

ELF Opengo: an Analysis and Open Reimplementation of Alphazero

Alpha Zero Paper

On Building Generalizable Learning Agents

Introduction to Deep Reinforcement Learning

On the Role of Planning in Model-Based Deep