Visualizing Muzero Models
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Game Changer
Matthew Sadler and Natasha Regan Game Changer AlphaZero’s Groundbreaking Chess Strategies and the Promise of AI New In Chess 2019 Contents Explanation of symbols 6 Foreword by Garry Kasparov �������������������������������������������������������������������������������� 7 Introduction by Demis Hassabis 11 Preface 16 Introduction ������������������������������������������������������������������������������������������������������������ 19 Part I AlphaZero’s history . 23 Chapter 1 A quick tour of computer chess competition 24 Chapter 2 ZeroZeroZero ������������������������������������������������������������������������������ 33 Chapter 3 Demis Hassabis, DeepMind and AI 54 Part II Inside the box . 67 Chapter 4 How AlphaZero thinks 68 Chapter 5 AlphaZero’s style – meeting in the middle 87 Part III Themes in AlphaZero’s play . 131 Chapter 6 Introduction to our selected AlphaZero themes 132 Chapter 7 Piece mobility: outposts 137 Chapter 8 Piece mobility: activity 168 Chapter 9 Attacking the king: the march of the rook’s pawn 208 Chapter 10 Attacking the king: colour complexes 235 Chapter 11 Attacking the king: sacrifices for time, space and damage 276 Chapter 12 Attacking the king: opposite-side castling 299 Chapter 13 Attacking the king: defence 321 Part IV AlphaZero’s -
Improving Model-Based Reinforcement Learning with Internal State Representations Through Self-Supervision
1 Improving Model-Based Reinforcement Learning with Internal State Representations through Self-Supervision Julien Scholz, Cornelius Weber, Muhammad Burhan Hafez and Stefan Wermter Knowledge Technology, Department of Informatics, University of Hamburg, Germany [email protected], fweber, hafez, [email protected] Abstract—Using a model of the environment, reinforcement the design decisions made for MuZero, namely, how uncon- learning agents can plan their future moves and achieve super- strained its learning process is. After explaining all necessary human performance in board games like Chess, Shogi, and Go, fundamentals, we propose two individual augmentations to while remaining relatively sample-efficient. As demonstrated by the MuZero Algorithm, the environment model can even be the algorithm that work fully unsupervised in an attempt to learned dynamically, generalizing the agent to many more tasks improve its performance and make it more reliable. After- while at the same time achieving state-of-the-art performance. ward, we evaluate our proposals by benchmarking the regular Notably, MuZero uses internal state representations derived from MuZero algorithm against the newly augmented creation. real environment states for its predictions. In this paper, we bind the model’s predicted internal state representation to the environ- II. BACKGROUND ment state via two additional terms: a reconstruction model loss and a simpler consistency loss, both of which work independently A. MuZero Algorithm and unsupervised, acting as constraints to stabilize the learning process. Our experiments show that this new integration of The MuZero algorithm [1] is a model-based reinforcement reconstruction model loss and simpler consistency loss provide a learning algorithm that builds upon the success of its prede- significant performance increase in OpenAI Gym environments. -
Survey on Reinforcement Learning for Language Processing
Survey on reinforcement learning for language processing V´ıctorUc-Cetina1, Nicol´asNavarro-Guerrero2, Anabel Martin-Gonzalez1, Cornelius Weber3, Stefan Wermter3 1 Universidad Aut´onomade Yucat´an- fuccetina, [email protected] 2 Aarhus University - [email protected] 3 Universit¨atHamburg - fweber, [email protected] February 2021 Abstract In recent years some researchers have explored the use of reinforcement learning (RL) algorithms as key components in the solution of various natural language process- ing tasks. For instance, some of these algorithms leveraging deep neural learning have found their way into conversational systems. This paper reviews the state of the art of RL methods for their possible use for different problems of natural language processing, focusing primarily on conversational systems, mainly due to their growing relevance. We provide detailed descriptions of the problems as well as discussions of why RL is well-suited to solve them. Also, we analyze the advantages and limitations of these methods. Finally, we elaborate on promising research directions in natural language processing that might benefit from reinforcement learning. Keywords| reinforcement learning, natural language processing, conversational systems, pars- ing, translation, text generation 1 Introduction Machine learning algorithms have been very successful to solve problems in the natural language pro- arXiv:2104.05565v1 [cs.CL] 12 Apr 2021 cessing (NLP) domain for many years, especially supervised and unsupervised methods. However, this is not the case with reinforcement learning (RL), which is somewhat surprising since in other domains, reinforcement learning methods have experienced an increased level of success with some impressive results, for instance in board games such as AlphaGo Zero [106]. -
Monte-Carlo Tree Search As Regularized Policy Optimization
Monte-Carlo tree search as regularized policy optimization Jean-Bastien Grill * 1 Florent Altche´ * 1 Yunhao Tang * 1 2 Thomas Hubert 3 Michal Valko 1 Ioannis Antonoglou 3 Remi´ Munos 1 Abstract AlphaZero employs an alternative handcrafted heuristic to achieve super-human performance on board games (Silver The combination of Monte-Carlo tree search et al., 2016). Recent MCTS-based MuZero (Schrittwieser (MCTS) with deep reinforcement learning has et al., 2019) has also led to state-of-the-art results in the led to significant advances in artificial intelli- Atari benchmarks (Bellemare et al., 2013). gence. However, AlphaZero, the current state- of-the-art MCTS algorithm, still relies on hand- Our main contribution is connecting MCTS algorithms, crafted heuristics that are only partially under- in particular the highly-successful AlphaZero, with MPO, stood. In this paper, we show that AlphaZero’s a state-of-the-art model-free policy-optimization algo- search heuristics, along with other common ones rithm (Abdolmaleki et al., 2018). Specifically, we show that such as UCT, are an approximation to the solu- the empirical visit distribution of actions in AlphaZero’s tion of a specific regularized policy optimization search procedure approximates the solution of a regularized problem. With this insight, we propose a variant policy-optimization objective. With this insight, our second of AlphaZero which uses the exact solution to contribution a modified version of AlphaZero that comes this policy optimization problem, and show exper- significant performance gains over the original algorithm, imentally that it reliably outperforms the original especially in cases where AlphaZero has been observed to algorithm in multiple domains. -
Multi-Agent Reinforcement Learning: a Review of Challenges and Applications
applied sciences Review Multi-Agent Reinforcement Learning: A Review of Challenges and Applications Lorenzo Canese †, Gian Carlo Cardarilli, Luca Di Nunzio †, Rocco Fazzolari † , Daniele Giardino †, Marco Re † and Sergio Spanò *,† Department of Electronic Engineering, University of Rome “Tor Vergata”, Via del Politecnico 1, 00133 Rome, Italy; [email protected] (L.C.); [email protected] (G.C.C.); [email protected] (L.D.N.); [email protected] (R.F.); [email protected] (D.G.); [email protected] (M.R.) * Correspondence: [email protected]; Tel.: +39-06-7259-7273 † These authors contributed equally to this work. Abstract: In this review, we present an analysis of the most used multi-agent reinforcement learn- ing algorithms. Starting with the single-agent reinforcement learning algorithms, we focus on the most critical issues that must be taken into account in their extension to multi-agent scenarios. The analyzed algorithms were grouped according to their features. We present a detailed taxonomy of the main multi-agent approaches proposed in the literature, focusing on their related mathematical models. For each algorithm, we describe the possible application fields, while pointing out its pros and cons. The described multi-agent algorithms are compared in terms of the most important char- acteristics for multi-agent reinforcement learning applications—namely, nonstationarity, scalability, and observability. We also describe the most common benchmark environments used to evaluate the Citation: Canese, L.; Cardarilli, G.C.; performances of the considered methods. Di Nunzio, L.; Fazzolari, R.; Giardino, D.; Re, M.; Spanò, S. Multi-Agent Keywords: machine learning; reinforcement learning; multi-agent; swarm Reinforcement Learning: A Review of Challenges and Applications. -
Combining Off and On-Policy Training in Model-Based Reinforcement Learning
Combining Off and On-Policy Training in Model-Based Reinforcement Learning Alexandre Borges Arlindo L. Oliveira INESC-ID / Instituto Superior Técnico INESC-ID / Instituto Superior Técnico University of Lisbon University of Lisbon Lisbon, Portugal Lisbon, Portugal ABSTRACT factor and game length. AlphaGo [1] was the first computer pro- The combination of deep learning and Monte Carlo Tree Search gram to beat a professional human player in the game of Go by (MCTS) has shown to be effective in various domains, such as combining deep neural networks with tree search. In the first train- board and video games. AlphaGo [1] represented a significant step ing phase, the system learned from expert games, but a second forward in our ability to learn complex board games, and it was training phase enabled the system to improve its performance by rapidly followed by significant advances, such as AlphaGo Zero self-play using reinforcement learning. AlphaGoZero [2] learned [2] and AlphaZero [3]. Recently, MuZero [4] demonstrated that only through self-play and was able to beat AlphaGo soundly. Alp- it is possible to master both Atari games and board games by di- haZero [3] improved on AlphaGoZero by generalizing the model rectly learning a model of the environment, which is then used to any board game. However, these methods were designed only with Monte Carlo Tree Search (MCTS) [5] to decide what move for board games, specifically for two-player zero-sum games. to play in each position. During tree search, the algorithm simu- Recently, MuZero [4] improved on AlphaZero by generalizing lates games by exploring several possible moves and then picks the it even more, so that it can learn to play singe-agent games while, action that corresponds to the most promising trajectory. -
Efficiently Mastering the Game of Nogo with Deep Reinforcement
electronics Article Efficiently Mastering the Game of NoGo with Deep Reinforcement Learning Supported by Domain Knowledge Yifan Gao 1,*,† and Lezhou Wu 2,† 1 College of Medicine and Biological Information Engineering, Northeastern University, Liaoning 110819, China 2 College of Information Science and Engineering, Northeastern University, Liaoning 110819, China; [email protected] * Correspondence: [email protected] † These authors contributed equally to this work. Abstract: Computer games have been regarded as an important field of artificial intelligence (AI) for a long time. The AlphaZero structure has been successful in the game of Go, beating the top professional human players and becoming the baseline method in computer games. However, the AlphaZero training process requires tremendous computing resources, imposing additional difficulties for the AlphaZero-based AI. In this paper, we propose NoGoZero+ to improve the AlphaZero process and apply it to a game similar to Go, NoGo. NoGoZero+ employs several innovative features to improve training speed and performance, and most improvement strategies can be transferred to other nonspecific areas. This paper compares it with the original AlphaZero process, and results show that NoGoZero+ increases the training speed to about six times that of the original AlphaZero process. Moreover, in the experiment, our agent beat the original AlphaZero agent with a score of 81:19 after only being trained by 20,000 self-play games’ data (small in quantity compared with Citation: Gao, Y.; Wu, L. Efficiently 120,000 self-play games’ data consumed by the original AlphaZero). The NoGo game program based Mastering the Game of NoGo with on NoGoZero+ was the runner-up in the 2020 China Computer Game Championship (CCGC) with Deep Reinforcement Learning limited resources, defeating many AlphaZero-based programs. -
ELF Opengo: an Analysis and Open Reimplementation of Alphazero
ELF OpenGo: An Analysis and Open Reimplementation of AlphaZero Yuandong Tian 1 Jerry Ma * 1 Qucheng Gong * 1 Shubho Sengupta * 1 Zhuoyuan Chen 1 James Pinkerton 1 C. Lawrence Zitnick 1 Abstract However, these advances in playing ability come at signifi- The AlphaGo, AlphaGo Zero, and AlphaZero cant computational expense. A single training run requires series of algorithms are remarkable demonstra- millions of selfplay games and days of training on thousands tions of deep reinforcement learning’s capabili- of TPUs, which is an unattainable level of compute for the ties, achieving superhuman performance in the majority of the research community. When combined with complex game of Go with progressively increas- the unavailability of code and models, the result is that the ing autonomy. However, many obstacles remain approach is very difficult, if not impossible, to reproduce, in the understanding of and usability of these study, improve upon, and extend. promising approaches by the research commu- In this paper, we propose ELF OpenGo, an open-source nity. Toward elucidating unresolved mysteries reimplementation of the AlphaZero (Silver et al., 2018) and facilitating future research, we propose ELF algorithm for the game of Go. We then apply ELF OpenGo OpenGo, an open-source reimplementation of the toward the following three additional contributions. AlphaZero algorithm. ELF OpenGo is the first open-source Go AI to convincingly demonstrate First, we train a superhuman model for ELF OpenGo. Af- superhuman performance with a perfect (20:0) ter running our AlphaZero-style training software on 2,000 record against global top professionals. We ap- GPUs for 9 days, our 20-block model has achieved super- ply ELF OpenGo to conduct extensive ablation human performance that is arguably comparable to the 20- studies, and to identify and analyze numerous in- block models described in Silver et al.(2017) and Silver teresting phenomena in both the model training et al.(2018). -
Alpha Zero Paper
RESEARCH COMPUTER SCIENCE programmers, combined with a high-performance alpha-beta search that expands a vast search tree by using a large number of clever heuristics and domain-specific adaptations. In (10) we describe A general reinforcement learning these augmentations, focusing on the 2016 Top Chess Engine Championship (TCEC) season 9 algorithm that masters chess, shogi, world champion Stockfish (11); other strong chess programs, including Deep Blue, use very similar and Go through self-play architectures (1, 12). In terms of game tree complexity, shogi is a substantially harder game than chess (13, 14): It David Silver1,2*†, Thomas Hubert1*, Julian Schrittwieser1*, Ioannis Antonoglou1, is played on a larger board with a wider variety of Matthew Lai1, Arthur Guez1, Marc Lanctot1, Laurent Sifre1, Dharshan Kumaran1, pieces; any captured opponent piece switches Thore Graepel1, Timothy Lillicrap1, Karen Simonyan1, Demis Hassabis1† sides and may subsequently be dropped anywhere on the board. The strongest shogi programs, such Thegameofchessisthelongest-studieddomainin the history of artificial intelligence. as the 2017 Computer Shogi Association (CSA) The strongest programs are based on a combination of sophisticated search techniques, world champion Elmo, have only recently de- domain-specific adaptations, and handcrafted evaluation functions that have been refined feated human champions (15). These programs by human experts over several decades. By contrast, the AlphaGo Zero program recently use an algorithm similar to those used by com- achieved superhuman performance in the game of Go by reinforcement learning from self-play. puter chess programs, again based on a highly In this paper, we generalize this approach into a single AlphaZero algorithm that can achieve Downloaded from optimized alpha-beta search engine with many superhuman performance in many challenging games. -
On Building Generalizable Learning Agents
On Building Generalizable Learning Agents Yi Wu Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2020-9 http://www2.eecs.berkeley.edu/Pubs/TechRpts/2020/EECS-2020-9.html January 10, 2020 Copyright © 2020, by the author(s). All rights reserved. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission. On Building Generalizable Learning Agents by Yi Wu A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Stuart Russell, Chair Professor Pieter Abbeel Associate Professor Javad Lavaei Assistant Professor Aviv Tamar Fall 2019 On Building Generalizable Learning Agents Copyright 2019 by Yi Wu 1 Abstract On Building Generalizable Learning Agents by Yi Wu Doctor of Philosophy in Computer Science University of California, Berkeley Professor Stuart Russell, Chair It has been a long-standing goal in Artificial Intelligence (AI) to build machines that can solve tasks that humans can. Thanks to the recent rapid progress in data-driven methods, which train agents to solve tasks by learning from massive training data, there have been many successes in applying such learning approaches to handle and even solve a number of extremely challenging tasks, including image classification, language generation, robotics control, and several complex multi-player games. -
Introduction to Deep Reinforcement Learning
Parallel & Scalable Machine Learning Introduction to Machine Learning Algorithms Dr. Jenia Jitsev Head of Cross-Sectional Team Deep Learning (CST-DL) Scientific Lead Helmholtz AI Local (HAICU Local) Institute for Advanced Simulation (IAS) Juelich Supercomputing Center (JSC) @Jenia Jitsev LECTURE 9 Introduction to Deep Reinforcement Learning Feb 19th, 2020 JSC, Germany Machine Learning: Forms of Learning ● Supervised learning: correct responses Y for input data X are given; → Y “teacher” signal, correct “outcomes”, “labels” for the data X - usually : estimate unknown f: X→Y, y = f(x; W) - classical frameworks: classification, regression ● Unsupervised learning: only data X is given - find “hidden” structure, patterns in the data; - in general, estimate unknown probability density p(X) - in general : find a model that underlies / generates X - broad class of latent (“hidden”) variable models - classical frameworks: clustering, dimensionality reduction (e.g PCA) ● Reinforcement learning: data X, including (sparse) reward r(X) - discover actions a that maximize total future reward R t f a - active learning: experience X depends on choice of a h c s n i e - estimate p(a|X), p(r|X), V(X), Q(X,a) – future reward predictors m e G - z t l o h m l ● e H For all holds: r e d n i - Define a loss L(D,W), optimize by tuning free parameters W d e i l g t i M Deep Neural Networks: Forms of Learning ● Supervised learning: correct responses Y for input data X are given - find unknown f: y = f(x;W) or density p(Y|X;W) for data (x,y) - Deep CNN for visual object recognition (e.g, Inception, ResNet, ...) ● Unsupervised learning: only data X is given - general setting: estimate unknown density p(X;W) - find a model that underlies / generates X - broad class of latent (“hidden”) variable models - Variational Autoencoder (VAE), data generation and inference - Generative Adversarial Networks (GANs), data generation - Autoregressive models : PixelRNN, PixelCNN, .. -
On the Role of Planning in Model-Based Deep
Published as a conference paper at ICLR 2021 ON THE ROLE OF PLANNING IN MODEL-BASED DEEP REINFORCEMENT LEARNING Jessica B. Hamrick,∗ Abram L. Friesen, Feryal Behbahani, Arthur Guez, Fabio Viola, Sims Witherspoon, Thomas Anthony, Lars Buesing, Petar Velickoviˇ c,´ Theophane´ Weber∗ DeepMind, London, UK ABSTRACT Model-based planning is often thought to be necessary for deep, careful reason- ing and generalization in artificial agents. While recent successes of model-based reinforcement learning (MBRL) with deep function approximation have strength- ened this hypothesis, the resulting diversity of model-based methods has also made it difficult to track which components drive success and why. In this paper, we seek to disentangle the contributions of recent methods by focusing on three questions: (1) How does planning benefit MBRL agents? (2) Within planning, what choices drive performance? (3) To what extent does planning improve gen- eralization? To answer these questions, we study the performance of MuZero [58], a state-of-the-art MBRL algorithm with strong connections and overlapping com- ponents with many other MBRL algorithms. We perform a number of interven- tions and ablations of MuZero across a wide range of environments, including control tasks, Atari, and 9x9 Go. Our results suggest the following: (1) Planning is most useful in the learning process, both for policy updates and for providing a more useful data distribution. (2) Using shallow trees with simple Monte-Carlo rollouts is as performant as more complex methods, except in the most difficult reasoning tasks. (3) Planning alone is insufficient to drive strong generalization. These results indicate where and how to utilize planning in reinforcement learning settings, and highlight a number of open questions for future MBRL research.