Visualizing Muzero Models

Visualizing MuZero Models Joery A. de Vries 1 2 Ken S. Voskuil 1 2 Thomas M. Moerland 1 Aske Plaat 1 Abstract model, which learns to predict future states, the MuZero MuZero, a model-based reinforcement learning dynamics model is trained to predict future values, better algorithm that uses a value equivalent dynamics known as a value equivalent model (Grimm et al., 2020). model, achieved state-of-the-art performance in A potential benefit of value equivalent models, compared to Chess, Shogi and the game of Go. In contrast to standard forward models, is that they will emphasize value standard forward dynamics models that predict a and reward relevant characteristics in their representation full next state, value equivalent models are trained and dynamics. This may be beneficial when the true dynam- to predict a future value, thereby emphasizing ics are complicated, but the value relevant aspects of the value relevant information in the representations. dynamics are comparatively simple. As a second benefit, While value equivalent models have shown strong we train our model for its intended use: predicting value empirical success, there is no research yet that information during planning. Several papers have empiri- visualizes and investigates what types of repre- cally investigated this principle in recent years (Tamar et al., sentations these models actually learn. Therefore, 2016; Oh et al., 2017; Farquhar et al., 2018; Silver et al., in this paper we visualize the latent representa- 2017b; Schrittwieser et al., 2020), while Grimm et al.(2020) tion of MuZero agents. We find that action tra- provides a theoretical underpinning of this approach. jectories may diverge between observation em- beddings and internal state transition dynamics, However, no literature has yet investigated what kind of which could lead to instability during planning. representations these approaches actually learn, i.e., how the Based on this insight, we propose two regular- learned representations are organized. The goal of this paper ization techniques to stabilize MuZero’s perfor- is therefore to investigate and visualize environment models mance. Additionally, we provide an open-source learned by MuZero. Most interestingly, we find that, after implementation of MuZero along with an interac- training, an action trajectory that follows the forward dy- tive visualizer of learned representations, which namics model usually departs from the learned embedding may aid further investigation of value equivalent of the environment observations. In other words, MuZero algorithms. is not enforced to keep the state encoding and forward state prediction congruent. Therefore, the second goal of this paper is to regularize MuZero’s dynamics model to improve 1. Introduction its structure. We propose two regularization objectives to add to the MuZero objective, and experimentally show that Model-based reinforcement learning has shown strong em- these may indeed provide benefit. pirical success in sequential decision making tasks, as illus- trated by the AlphaZero (Silver et al., 2018) and MuZero In short, after introducing related work (Sec.2) and nec- arXiv:2102.12924v2 [cs.LG] 3 Mar 2021 algorithms (Schrittwieser et al., 2020). Both of these ap- essary background on the MuZero algorithm (Sec.3), we proaches nest a planning loop, based on Monte Carlo Tree discuss two research questions: 1) what type of representa- Search (Kocsis & Szepesvari´ , 2006; Browne et al., 2012), tion do value equivalent models learn (Sec.4), and 2) can inside a learning loop, where we approximate global value we use regularization to better structure the value equivalent and policy functions. While AlphaZero used a known model latent space (Sec.5)? We experimentally validate the second of the environment, MuZero learned the model from sam- question in Sec.6 and7. Moreover, apart from answering pled data. However, instead of a standard forward dynamics these two question, we also open source modular MuZero code including an interactive visualizer of the latent space 1Leiden Institute of Advanced Computer Science, Leiden, The based on principal component analysis (PCA), available 2 Netherlands Authors contributed equally. Correspondence to: from https://github.com/kaesve/muzero. We Joery A. de Vries <[email protected]>, Ken S. Voskuil <[email protected]>. found the visualizer to greatly enhance our understanding of the algorithm, and believe visualization will be essential In submission to International Conference on Machine Learning for deeper understanding of this class of algorithms. (ICML) 2021. Visualizing MuZero Models 2. Related Work (Szymanski & McCane, 2011). Note that the most common approach to visualize neural network mappings is through Value equivalent models, a term introduced by Grimm et al. non-linear dimensionality reduction techniques, such as (2020), are usually trained on end-to-end differentiable com- Stochastic Neighbour Embedding (Hinton & Roweis, 2002). putation graphs, although the principle would be applicable We instead focus on linear projections in low dimensional to gradient-free optimization as well. Typically, the un- environments, as non-linear dimensionality reduction has rolled computation graph makes multiple passes through the risk of altering the semantics of the MDP models. a dynamics model, and eventually predicts a value. Then, the dynamics model is trained through gradient descent on its ability to predict the correct value. The first value 3. The MuZero Algorithm equivalent approach were Value Iteration Networks (VIN) We briefly introduce the MuZero algorithm (Schrittwieser (Tamar et al., 2016), where a differentiable form of value et al., 2020). We assume a Markov Decision Process (MDP) iteration was embedded to predict a value. Other variants specification given by the tuple hS; A; T ; U; γi, which re- of value equivalent approaches are Value Prediction Net- spectively represent the set of states (S), the set of actions works (VPN) (Oh et al., 2017), TreeQN and ATreeC (Far- (A), the transition dynamics mapping state-action pairs to quhar et al., 2018), the Predictron (Silver et al., 2017b), and new states (T : S × A ! p(S)), the reward function map- MuZero (Schrittwieser et al., 2020). These methods differ ping state-action pairs to rewards (U : S × A ! R), and a in the way they build their computation graph, where VINS discount parameter (γ 2 [0; 1])(Sutton & Barto, 2018). and TreeQN embed entire policy improvement (planning) in Internally, we define an abstract MDP hS; A; T ; R; γi, the graph, VPNs, the Predictron and MuZero only perform e e S policy evaluation. Therefore, the latter approaches combine where e denotes an abstract state space, with correspond- T : S × A ! S explicit planning for policy improvement, which in the case ing dynamics e e e, and reward prediction R : S × A ! π : S! p(A) of MuZero happens through MCTS. Grimm et al.(2020) e R. Our goal is to find a policy V (s) provides a theoretical analysis of value equivalent models, that maximizes the value from the start state, where V (s) showing that two value equivalent models give the same is defined as the expected infinite-horizon cumulative Bellman back-up. return: MuZero uses the learned value equivalent model to ex- 1 h X t i plicitly plan through Monte Carlo Tree Search (Kocsis & V (s) = Eπ;T γ · utjs0 = s : (1) Szepesvari´ , 2006; Browne et al., 2012), and uses the output t=0 of the search as training targets for a learned policy network. This idea of iterated planning and learning dates back to We define three distinct neural networks to approximate Dyna-2 (Silver et al., 2008), while the particularly success- the above MDPs (Figure1): the state encoding/embedding ful combination of MCTS and deep learning was introduced function hθ, the dynamics function gθ, and the prediction in AlphaGo Zero (Silver et al., 2017a) and Expert Iteration network fθ, where θ denotes the joint set of parameters of (ExIt) (Anthony et al., 2017). In general, planning may add the networks. The encoding function hθ : S! Se maps a to pure (model-free) reinforcement learning: 1) improved (sequence of) real MDP observations to a latent MDP state. action selection, and 2) improved (more stable) training The dynamics function gθ : Se × A ! Se × R predicts the targets. On the other hand, learning adds to planning the next latent state and the associated reward of the transition. ability to generalize information, and store global solutions In practice, we slightly abuse notation and also write gθ to in memory. For more detailed overviews of value equivalent only specify the next state prediction. Finally, the prediction models and iterated planning and learning we refer to the network fθ : S!e p(A) × R predicts both the policy and model-based RL surveys by Moerland et al.(2020); Plaat value for some abstract state s~. We will identify the separate k k k et al.(2020). predictions of fθ(~st ) by pt and Vt , respectively, where subscripts denote the time index in the true environment, Visualization is a common approach to better understand and superscripts index the timestep in the latent environment. machine learning methods, and visualization of represen- Also, we write µθ = (hθ; gθ; fθ) for the joint model. tations and loss landscapes of (deep) neural networks has a long history (Bischof et al., 1992; Yosinski et al., 2015; Together, these three networks can be chained to form a Karpathy et al., 2015; Li et al., 2018a). For example, Li et al. larger computation graph that follows a single trace, start- (2018b) shows how the loss landscape of a neural network ing from state st following action sequence (at; : : : ; at+n). can indicate smoothness of the optimization criterion. Visu- First, we use the embedding network to obtain the first 0 alization is also important in other areas of machine learning, latent state from the sequence of observations: s~t = for example to illustrate how kernel-methods project low di- hθ(s0; : : : ; st).

Load more