Introduction to Deep Reinforcement Learning

Total Page:16

File Type:pdf, Size:1020Kb

Introduction to Deep Reinforcement Learning Parallel & Scalable Machine Learning Introduction to Machine Learning Algorithms Dr. Jenia Jitsev Head of Cross-Sectional Team Deep Learning (CST-DL) Scientific Lead Helmholtz AI Local (HAICU Local) Institute for Advanced Simulation (IAS) Juelich Supercomputing Center (JSC) @Jenia Jitsev LECTURE 9 Introduction to Deep Reinforcement Learning Feb 19th, 2020 JSC, Germany Machine Learning: Forms of Learning ● Supervised learning: correct responses Y for input data X are given; → Y “teacher” signal, correct “outcomes”, “labels” for the data X - usually : estimate unknown f: X→Y, y = f(x; W) - classical frameworks: classification, regression ● Unsupervised learning: only data X is given - find “hidden” structure, patterns in the data; - in general, estimate unknown probability density p(X) - in general : find a model that underlies / generates X - broad class of latent (“hidden”) variable models - classical frameworks: clustering, dimensionality reduction (e.g PCA) ● Reinforcement learning: data X, including (sparse) reward r(X) - discover actions a that maximize total future reward R t f a - active learning: experience X depends on choice of a h c s n i e - estimate p(a|X), p(r|X), V(X), Q(X,a) – future reward predictors m e G - z t l o h m l ● e H For all holds: r e d n i - Define a loss L(D,W), optimize by tuning free parameters W d e i l g t i M Deep Neural Networks: Forms of Learning ● Supervised learning: correct responses Y for input data X are given - find unknown f: y = f(x;W) or density p(Y|X;W) for data (x,y) - Deep CNN for visual object recognition (e.g, Inception, ResNet, ...) ● Unsupervised learning: only data X is given - general setting: estimate unknown density p(X;W) - find a model that underlies / generates X - broad class of latent (“hidden”) variable models - Variational Autoencoder (VAE), data generation and inference - Generative Adversarial Networks (GANs), data generation - Autoregressive models : PixelRNN, PixelCNN, ... ● Reinforcement learning: data X, including (sparse) reward r(X) - find predictors R = f(X;W) to estimate total future reward R(X) t f a - Deep Q-Learning Network (DQN – Atari games: Breakout, ...) h c s n i e - Deep Actor-Critic Networks (A3C, PPO, SAC, ...) m e G - z t l - Parts of AlphaGo; AlphaZero; AlphaFold, ... o h m l e H r e d n i → For all holds: d e i l g t i - Define a loss L(x,y,W), optimize by tuning parameters W M Mitglied in der Helmholtz-Gemeinschaft Deep Reinforcement Learning: breakthroughs Learning: Reinforcement Deep Mitglied in der Helmholtz-Gemeinschaft Deep Reinforcement Learning: breakthroughs Learning: Reinforcement Deep Mnih Mnih et al, Nature,2015 Mitglied in der Helmholtz-Gemeinschaft Deep neural networks: reinforcement learning neuralreinforcement networks: Deep Mnihet al, Nature, 2015 Mitglied in der Helmholtz-Gemeinschaft Deep Reinforcement Learning: breakthroughs Learning: Reinforcement Deep Mnih Mnih et al, Nature,2015 Mitglied in der Helmholtz-Gemeinschaft Deep neural networks: reinforcement learning neuralreinforcement networks: Deep Mitglied in der Helmholtz-Gemeinschaft Deep Learning: transforming the field transforming Learning: Deep Silver Silver et al,Nature, 2015 ● world championLee Sedol AlphaGowins against4:1 GO 18-time March2016: DeepMind Mitglied in der Helmholtz-Gemeinschaft Deep Reinforcement Learning: breakthroughs Learning: Reinforcement Deep Silver et Silver Nature,al, 2016 Mitglied in der Helmholtz-Gemeinschaft Deep Learning: transforming the field transforming Learning: Deep Silver et Silver Nature,al, 2017 ● ● Learning a function instead hardof wiring throughit previous insight AlphaGo:surpassing previous state artby the of dramatic margin Mitglied in der Helmholtz-Gemeinschaft Deep Reinforcement Learning: breakthroughs Learning: Reinforcement Deep Silver Silver et al, Nature,2017 that provide new of new insightsgames.” the oldest into provide that as well novelstrategies as knowledge, of this Go rasa days,starting few of a space books.In the collectively and distilledproverbs into patterns, of years, thousands millionsplayed games over of “ Humankind has accumulated Go knowledge from from knowledge Go accumulated has Humankind , AlphaGo Zero was able to rediscover rediscover much to was able Zero AlphaGo , tabula Deep Reinforcement Learning: breakthroughs ● Learning from self-play only – reward for winning games ● Human play data is no longer necessary for training Hardware Elo rating Matches Versions AlphaGo 176 GPUs,[2] 3,144[1] 5:0 against Fan Hui Fan distributed AlphaGo 48 TPUs,[2] 3,739[1] 4:1 against Lee distributed Lee Sedol AlphaGo 4 TPUs[2] v2, 4,858[1] 60:0 against Master single machine professional players; Future of Go Summit t f a h c s n i e AlphaGo 4 TPUs[2] v2, 5,185[1] 100:0 against m e G - z Zero single machine AlphaGo Lee t l o h m l e H 89:11 against r e d n i AlphaGo Master d e i l g t i M Silver et al, Nature, 2017 Mitglied in der Helmholtz-Gemeinschaft Deep Reinforcement Learning: breakthroughs Learning: Reinforcement Deep Silver et Silver Nature,al, 2017; Segler et al, Nature, 2018 ● (“bad”,“incorrect”) asoutcome some- states are desired (“good”, “correct”) or undesired suitable- for anytypeproblem of with statetransitions optimizationproblems (e.g,chemical synthesis, …) AlphaZero employsdeep neural networks:generic control / Chemistry Game... Mitglied in der Helmholtz-Gemeinschaft Deep Reinforcement Learning: breakthroughs Learning: Reinforcement Deep Evanset. al., 2018 ● AlphaFold Mitglied in der Helmholtz-Gemeinschaft Deep Reinforcement Learning: breakthroughs Learning: Reinforcement Deep ● AlphaStar Mitglied in der Helmholtz-Gemeinschaft Deep Reinforcement Learning: breakthroughs Learning: Reinforcement Deep Vinyals et al, Vinyals Nature,2019 Mitglied in der Helmholtz-Gemeinschaft Deep RL: JURECA RL: Deep /p/project/training2001/$USER/ cp -r/p/project/training2001/practicals_DeepRL folder: Copy DeepRL ( . python . python OpenAI Gym Environment: Testing . source . source . source . source the environment: Prepare cd variables variables /p/project/training2001/$USER/DeepRL /Test/test_opengym_pong_breakout.py /Test/test_opengym_CartPole.py /scripts/install_packages_DeepRL.sh /scripts/load_env_DeepRL.sh /scripts/init_env_DeepRL.sh /scripts/load_module_DeepRL.sh render , steps , episodes may be adapted accordingly) adapted be may (https://gym.openai.com) Deep Neural Networks: Forms of Learning ● Supervised learning: Find unknown f: y = f(x) for data (x,y) ● For each input x, correct example y is provided ● Define a loss L(x,y,W), optimize by tuning parameters W W1 . Wk . X . ... Y t f a h c s n i e m e G - z t l Unknown function: Y = f(X;W) o h m l e H Probabilistic view: unknown density P(Y|X,W) r e d n i d e i l g t i M Deep Neural Networks: Forms of Learning ● Supervised learning: Find unknown f: y = f(x) for data (x,y) ● For each input x, correct example y is provided ● Define a loss L(x,y,W), optimize by tuning parameters W W1 . Wk . X . ... Y t f a h c s n i e m e G - z t l Unknown function: Y = f(X;W) o h m l e H Probabilistic view: unknown density P(Y|X,W) r e d n i d Learning: Loss L(x,y,W) (defined over observed data; e i l g t i M e.g., gradient descent adapting W to minimize loss) Deep Neural Networks: Forms of Learning ● Learn y = f(x) from data (x,y) by adapting W ● However: delivering data with explicit specific teacher signal implausible or impractical (expensive labeling required) Z1 Zk W1 . Wk . X . ... Y t f a h c s n i e m e G - z t l Unknown function: Y = f(X;W) o h m l e H Probabilistic view: unknown density P(Y|X,W) r e d n i d Learning: Loss L(x,y,W) (defined over observed data; e i l g t i M gradient descent adapting W to minimize loss) Deep Neural Networks: Forms of Learning ● What if the correct output y for inputs x to the network is unknown? ● What are desired, relevant f(x) in this case anyway? Z1 Zk W1 . Wk . X . ... Y ? t ? f a h c s n i e m e G - z t l Unknown function: Y = f(X;W) o h m l e H Probabilistic view: unknown density P(Y|X,W) r e d n i d Learning: Loss L(x,y,W) (defined over observed data; e i l g t i M gradient descent adapting W to minimize loss) Mitglied in der Helmholtz-Gemeinschaft Reinforcement Learning Reinforcement ● ● (aboutstates, rewards, etc) predictions Predictionerror driven learning relevant consequences Reinforcementlearning: - using- “reward/punishment” signals defineto loss f( x ) and incoming observations updateto beliefs/expectations from of loss self-generatedactions L( x useavailable sensory andinput , W ) : use : mismatch (“good”/”bad”) extractto between L( x , own W ) Mitglied in der Helmholtz-Gemeinschaft Reinforcement Learning (RL) Learning Reinforcement Barto et al, 1983; et Niv 2006al, Reinforcement Learning (RL) Environment/ a) Tasks reward r Agent/Control X, s V(s,a)s) π(s,a)s,a)) sensory input Q(s,a)s,a)) action t f a ● h c MDP: Markov Decision Process < S, A, P, R, γ > s n i e m ● e Objective: “solve” towards optimal “policy” π(s,a)s,a)) - getting maximal G - z t l o h reward in the given environment m l e H r ● e For each state , select a response, action that is “optimal” d s a) n i d ● e i l Estimating Q(s,a)s,a)), π(s,a)s,a)), V(s,a)s) would provide full solution g t i M Barto et al, 1983; Niv et al, 2006 Mitglied in der Helmholtz-Gemeinschaft Deep Reinforcement Learning Reinforcement Deep ● ● Usenotion “good”,of “bad” outcome: “reward/punishment” signals correctWhatthe if output X W 1 Z . 1 y forinputs . W ... k Z .
Recommended publications
  • Game Changer
    Matthew Sadler and Natasha Regan Game Changer AlphaZero’s Groundbreaking Chess Strategies and the Promise of AI New In Chess 2019 Contents Explanation of symbols 6 Foreword by Garry Kasparov �������������������������������������������������������������������������������� 7 Introduction by Demis Hassabis 11 Preface 16 Introduction ������������������������������������������������������������������������������������������������������������ 19 Part I AlphaZero’s history . 23 Chapter 1 A quick tour of computer chess competition 24 Chapter 2 ZeroZeroZero ������������������������������������������������������������������������������ 33 Chapter 3 Demis Hassabis, DeepMind and AI 54 Part II Inside the box . 67 Chapter 4 How AlphaZero thinks 68 Chapter 5 AlphaZero’s style – meeting in the middle 87 Part III Themes in AlphaZero’s play . 131 Chapter 6 Introduction to our selected AlphaZero themes 132 Chapter 7 Piece mobility: outposts 137 Chapter 8 Piece mobility: activity 168 Chapter 9 Attacking the king: the march of the rook’s pawn 208 Chapter 10 Attacking the king: colour complexes 235 Chapter 11 Attacking the king: sacrifices for time, space and damage 276 Chapter 12 Attacking the king: opposite-side castling 299 Chapter 13 Attacking the king: defence 321 Part IV AlphaZero’s
    [Show full text]
  • Monte-Carlo Tree Search As Regularized Policy Optimization
    Monte-Carlo tree search as regularized policy optimization Jean-Bastien Grill * 1 Florent Altche´ * 1 Yunhao Tang * 1 2 Thomas Hubert 3 Michal Valko 1 Ioannis Antonoglou 3 Remi´ Munos 1 Abstract AlphaZero employs an alternative handcrafted heuristic to achieve super-human performance on board games (Silver The combination of Monte-Carlo tree search et al., 2016). Recent MCTS-based MuZero (Schrittwieser (MCTS) with deep reinforcement learning has et al., 2019) has also led to state-of-the-art results in the led to significant advances in artificial intelli- Atari benchmarks (Bellemare et al., 2013). gence. However, AlphaZero, the current state- of-the-art MCTS algorithm, still relies on hand- Our main contribution is connecting MCTS algorithms, crafted heuristics that are only partially under- in particular the highly-successful AlphaZero, with MPO, stood. In this paper, we show that AlphaZero’s a state-of-the-art model-free policy-optimization algo- search heuristics, along with other common ones rithm (Abdolmaleki et al., 2018). Specifically, we show that such as UCT, are an approximation to the solu- the empirical visit distribution of actions in AlphaZero’s tion of a specific regularized policy optimization search procedure approximates the solution of a regularized problem. With this insight, we propose a variant policy-optimization objective. With this insight, our second of AlphaZero which uses the exact solution to contribution a modified version of AlphaZero that comes this policy optimization problem, and show exper- significant performance gains over the original algorithm, imentally that it reliably outperforms the original especially in cases where AlphaZero has been observed to algorithm in multiple domains.
    [Show full text]
  • Efficiently Mastering the Game of Nogo with Deep Reinforcement
    electronics Article Efficiently Mastering the Game of NoGo with Deep Reinforcement Learning Supported by Domain Knowledge Yifan Gao 1,*,† and Lezhou Wu 2,† 1 College of Medicine and Biological Information Engineering, Northeastern University, Liaoning 110819, China 2 College of Information Science and Engineering, Northeastern University, Liaoning 110819, China; [email protected] * Correspondence: [email protected] † These authors contributed equally to this work. Abstract: Computer games have been regarded as an important field of artificial intelligence (AI) for a long time. The AlphaZero structure has been successful in the game of Go, beating the top professional human players and becoming the baseline method in computer games. However, the AlphaZero training process requires tremendous computing resources, imposing additional difficulties for the AlphaZero-based AI. In this paper, we propose NoGoZero+ to improve the AlphaZero process and apply it to a game similar to Go, NoGo. NoGoZero+ employs several innovative features to improve training speed and performance, and most improvement strategies can be transferred to other nonspecific areas. This paper compares it with the original AlphaZero process, and results show that NoGoZero+ increases the training speed to about six times that of the original AlphaZero process. Moreover, in the experiment, our agent beat the original AlphaZero agent with a score of 81:19 after only being trained by 20,000 self-play games’ data (small in quantity compared with Citation: Gao, Y.; Wu, L. Efficiently 120,000 self-play games’ data consumed by the original AlphaZero). The NoGo game program based Mastering the Game of NoGo with on NoGoZero+ was the runner-up in the 2020 China Computer Game Championship (CCGC) with Deep Reinforcement Learning limited resources, defeating many AlphaZero-based programs.
    [Show full text]
  • ELF Opengo: an Analysis and Open Reimplementation of Alphazero
    ELF OpenGo: An Analysis and Open Reimplementation of AlphaZero Yuandong Tian 1 Jerry Ma * 1 Qucheng Gong * 1 Shubho Sengupta * 1 Zhuoyuan Chen 1 James Pinkerton 1 C. Lawrence Zitnick 1 Abstract However, these advances in playing ability come at signifi- The AlphaGo, AlphaGo Zero, and AlphaZero cant computational expense. A single training run requires series of algorithms are remarkable demonstra- millions of selfplay games and days of training on thousands tions of deep reinforcement learning’s capabili- of TPUs, which is an unattainable level of compute for the ties, achieving superhuman performance in the majority of the research community. When combined with complex game of Go with progressively increas- the unavailability of code and models, the result is that the ing autonomy. However, many obstacles remain approach is very difficult, if not impossible, to reproduce, in the understanding of and usability of these study, improve upon, and extend. promising approaches by the research commu- In this paper, we propose ELF OpenGo, an open-source nity. Toward elucidating unresolved mysteries reimplementation of the AlphaZero (Silver et al., 2018) and facilitating future research, we propose ELF algorithm for the game of Go. We then apply ELF OpenGo OpenGo, an open-source reimplementation of the toward the following three additional contributions. AlphaZero algorithm. ELF OpenGo is the first open-source Go AI to convincingly demonstrate First, we train a superhuman model for ELF OpenGo. Af- superhuman performance with a perfect (20:0) ter running our AlphaZero-style training software on 2,000 record against global top professionals. We ap- GPUs for 9 days, our 20-block model has achieved super- ply ELF OpenGo to conduct extensive ablation human performance that is arguably comparable to the 20- studies, and to identify and analyze numerous in- block models described in Silver et al.(2017) and Silver teresting phenomena in both the model training et al.(2018).
    [Show full text]
  • Alpha Zero Paper
    RESEARCH COMPUTER SCIENCE programmers, combined with a high-performance alpha-beta search that expands a vast search tree by using a large number of clever heuristics and domain-specific adaptations. In (10) we describe A general reinforcement learning these augmentations, focusing on the 2016 Top Chess Engine Championship (TCEC) season 9 algorithm that masters chess, shogi, world champion Stockfish (11); other strong chess programs, including Deep Blue, use very similar and Go through self-play architectures (1, 12). In terms of game tree complexity, shogi is a substantially harder game than chess (13, 14): It David Silver1,2*†, Thomas Hubert1*, Julian Schrittwieser1*, Ioannis Antonoglou1, is played on a larger board with a wider variety of Matthew Lai1, Arthur Guez1, Marc Lanctot1, Laurent Sifre1, Dharshan Kumaran1, pieces; any captured opponent piece switches Thore Graepel1, Timothy Lillicrap1, Karen Simonyan1, Demis Hassabis1† sides and may subsequently be dropped anywhere on the board. The strongest shogi programs, such Thegameofchessisthelongest-studieddomainin the history of artificial intelligence. as the 2017 Computer Shogi Association (CSA) The strongest programs are based on a combination of sophisticated search techniques, world champion Elmo, have only recently de- domain-specific adaptations, and handcrafted evaluation functions that have been refined feated human champions (15). These programs by human experts over several decades. By contrast, the AlphaGo Zero program recently use an algorithm similar to those used by com- achieved superhuman performance in the game of Go by reinforcement learning from self-play. puter chess programs, again based on a highly In this paper, we generalize this approach into a single AlphaZero algorithm that can achieve Downloaded from optimized alpha-beta search engine with many superhuman performance in many challenging games.
    [Show full text]
  • Tackling Morpion Solitaire with Alphazero-Like Ranked Reward Reinforcement Learning
    Tackling Morpion Solitaire with AlphaZero-like Ranked Reward Reinforcement Learning Hui Wang, Mike Preuss, Michael Emmerich and Aske Plaat Leiden Institute of Advanced Computer Science Leiden University The Netherlands email: [email protected] Abstract—Morpion Solitaire is a popular single player game, AlphaGo and AlphaZero combine deep neural networks [9] performed with paper and pencil. Due to its large state space and Monte Carlo Tree Search (MCTS) [10] in a self-play (on the order of the game of Go) traditional search algorithms, framework that learns by curriculum learning [11]. Unfor- such as MCTS, have not been able to find good solutions. A later algorithm, Nested Rollout Policy Adaptation, was able to find a tunately, these approaches can not be directly used to play new record of 82 steps, albeit with large computational resources. single agent combinatorial games, such as travelling salesman After achieving this record, to the best of our knowledge, there problems (TSP) [12] and bin package problems (BPP) [13], has been no further progress reported, for about a decade. where cost minimization is the goal of the game. To apply In this paper we take the recent impressive performance self-play for single player games, Laterre et al. proposed a of deep self-learning reinforcement learning approaches from AlphaGo/AlphaZero as inspiration to design a searcher for Ranked Reward (R2) algorithm. R2 creates a relative per- Morpion Solitaire. A challenge of Morpion Solitaire is that the formance metric by means of ranking the rewards obtained state space is sparse, there are few win/loss signals.
    [Show full text]
  • Chess Fortresses, a Causal Test for State of the Art Symbolic[Neuro] Architectures
    Chess fortresses, a causal test for state of the art Symbolic[Neuro] architectures Hedinn Steingrimsson Electrical and Computer Engineering Rice University Houston, TX 77005 [email protected] Abstract—The AI community’s growing interest in causality (which can only move on white squares) does not contribute to is motivating the need for benchmarks that test the limits of battling the key g5 square (which is on a black square), which is neural network based probabilistic reasoning and state of the art sufficiently protected by the white pieces. These characteristics search algorithms. In this paper we present such a benchmark, chess fortresses. To make the benchmarking task as challenging beat the laws of probabilistic reasoning, where extra material as possible, we compile a test set of positions in which the typically means increased chances of winning the game. This defender has a unique best move for entering a fortress defense feature makes the new benchmark especially challenging for formation. The defense formation can be invariant to adding a neural based architectures. certain type of chess piece to the attacking side thereby defying the laws of probabilistic reasoning. The new dataset enables Our goal in this paper is to provide a new benchmark of efficient testing of move prediction since every experiment yields logical nature - aimed at being challenging or a hard class - a conclusive outcome, which is a significant improvement over which modern architectures can be measured against. This new traditional methods. We carry out extensive, large scale tests dataset is provided in the Supplement 1. Exploring hard classes of the convolutional neural networks of Leela Chess Zero [1], has proven to be fruitful ground for fundamental architectural an open source chess engine built on the same principles as AlphaZero (AZ), which has passed AZ in chess playing ability improvements in the past as shown by Olga Russakovsky’s according to many rating lists.
    [Show full text]
  • Accelerating and Improving Alphazero Using Population Based Training
    The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Accelerating and Improving AlphaZero Using Population Based Training Ti-Rong Wu,1 Ting-Han Wei,1,3 I-Chen Wu1,2 1Department of Computer Science, National Chiao Tung University, Taiwan 2Pervasive Artificial Intelligence Research (PAIR) Labs, Taiwan 3Department of Computing Science, University of Alberta, Edmonton, Canada {kds285, ting, icwu}@aigames.nctu.edu.tw Abstract a reimplementation of the AlphaGo Zero/AlphaZero algo- rithm, Facebook AI Research is also moving ahead with AlphaZero has been very successful in many games. Unfor- the more general ELF project (Tian et al. 2019), which is tunately, it still consumes a huge amount of computing re- aimed at covering a wider range of games for reinforcement sources, the majority of which is spent in self-play. Hyper- learning research. In late August, 2019, DeepMind also an- parameter tuning exacerbates the training cost since each hy- perparameter configuration requires its own time to train one nounced the OpenSpiel framework, with the goal of incor- run, during which it will generate its own self-play records. porating various games with different properties (Lanctot et As a result, multiple runs are usually needed for different hy- al. 2019). perparameter configurations. This paper proposes using pop- Given the continued interest in using games as a rein- ulation based training (PBT) to help tune hyperparameters forcement learning environment, there are still issues that dynamically and improve strength during training time. An- need to be resolved even with a powerful algorithm such other significant advantage is that this method requires a sin- as AlphaZero.
    [Show full text]
  • A General Reinforcement Learning Algorithm That Masters Chess, Shogi and Go Through Self-Play
    A general reinforcement learning algorithm that masters chess, shogi and Go through self-play David Silver,1;2∗ Thomas Hubert,1∗ Julian Schrittwieser,1∗ Ioannis Antonoglou,1;2 Matthew Lai,1 Arthur Guez,1 Marc Lanctot,1 Laurent Sifre,1 Dharshan Kumaran,1;2 Thore Graepel,1;2 Timothy Lillicrap,1 Karen Simonyan,1 Demis Hassabis1 1DeepMind, 6 Pancras Square, London N1C 4AG. 2University College London, Gower Street, London WC1E 6BT. ∗These authors contributed equally to this work. Abstract The game of chess is the longest-studied domain in the history of artificial intelligence. The strongest programs are based on a combination of sophisticated search techniques, domain-specific adaptations, and handcrafted evaluation functions that have been refined by human experts over several decades. By contrast, the AlphaGo Zero program recently achieved superhuman performance in the game of Go by reinforcement learning from self- play. In this paper, we generalize this approach into a single AlphaZero algorithm that can achieve superhuman performance in many challenging games. Starting from random play and given no domain knowledge except the game rules, AlphaZero convincingly defeated a world champion program in the games of chess and shogi (Japanese chess) as well as Go. The study of computer chess is as old as computer science itself. Charles Babbage, Alan Turing, Claude Shannon, and John von Neumann devised hardware, algorithms and theory to analyse and play the game of chess. Chess subsequently became a grand challenge task for a generation of artificial intelligence researchers, culminating in high-performance computer chess programs that play at a super-human level (1,2).
    [Show full text]
  • Playing Nondeterministic Games Through Planning with a Learned
    Under review as a conference paper at ICLR 2021 PLAYING NONDETERMINISTIC GAMES THROUGH PLANNING WITH A LEARNED MODEL Anonymous authors Paper under double-blind review ABSTRACT The MuZero algorithm is known for achieving high-level performance on tradi- tional zero-sum two-player games of perfect information such as chess, Go, and shogi, as well as visual, non-zero sum, single-player environments such as the Atari suite. Despite lacking a perfect simulator and employing a learned model of environmental dynamics, MuZero produces game-playing agents comparable to its predecessor, AlphaZero. However, the current implementation of MuZero is restricted only to deterministic environments. This paper presents Nondetermin- istic MuZero (NDMZ), an extension of MuZero for nondeterministic, two-player, zero-sum games of perfect information. Borrowing from Nondeterministic Monte Carlo Tree Search and the theory of extensive-form games, NDMZ formalizes chance as a player in the game and incorporates the chance player into the MuZero network architecture and tree search. Experiments show that NDMZ is capable of learning effective strategies and an accurate model of the game. 1 INTRODUCTION While the AlphaZero algorithm achieved superhuman performance in a variety of challenging do- mains, it relies upon a perfect simulation of the environment dynamics to perform precision plan- ning. MuZero, the newest member of the AlphaZero family, combines the advantages of planning with a learned model of its environment, allowing it to tackle problems such as the Atari suite without the advantage of a simulator. This paper presents Nondeterministic MuZero (NDMZ), an extension of MuZero to stochastic, two-player, zero-sum games of perfect information.
    [Show full text]
  • The Impact of Artificial Intelligence on the Chess World
    JMIR SERIOUS GAMES Duca Iliescu Viewpoint The Impact of Artificial Intelligence on the Chess World Delia Monica Duca Iliescu, BSc, MSc Transilvania University of Brasov, Brasov, Romania Corresponding Author: Delia Monica Duca Iliescu, BSc, MSc Transilvania University of Brasov Bdul Eroilor 29 Brasov Romania Phone: 40 268413000 Email: [email protected] Abstract This paper focuses on key areas in which artificial intelligence has affected the chess world, including cheat detection methods, which are especially necessary recently, as there has been an unexpected rise in the popularity of online chess. Many major chess events that were to take place in 2020 have been canceled, but the global popularity of chess has in fact grown in recent months due to easier conversion of the game from offline to online formats compared with other games. Still, though a game of chess can be easily played online, there are some concerns about the increased chances of cheating. Artificial intelligence can address these concerns. (JMIR Serious Games 2020;8(4):e24049) doi: 10.2196/24049 KEYWORDS artificial intelligence; games; chess; AlphaZero; MuZero; cheat detection; coronavirus strategic games, has affected many other areas of interest, as Introduction has already been seen in recent years. All major chess events that were to take place in the second half One of the most famous human versus machine events was the of 2020 have been canceled, including the 44th Chess Olympiad 1997 victory of Deep Blue, an IBM chess software, against the and the match for the title of World Chess Champion. However, famous chess champion Garry Kasparov [3].
    [Show full text]
  • Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model
    Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model Julian Schrittwieser,1∗ Ioannis Antonoglou,1;2∗ Thomas Hubert,1∗ Karen Simonyan,1 Laurent Sifre,1 Simon Schmitt,1 Arthur Guez,1 Edward Lockhart,1 Demis Hassabis,1 Thore Graepel,1;2 Timothy Lillicrap,1 David Silver1;2∗ 1DeepMind, 6 Pancras Square, London N1C 4AG. 2University College London, Gower Street, London WC1E 6BT. ∗These authors contributed equally to this work. Abstract Constructing agents with planning capabilities has long been one of the main challenges in the pursuit of artificial intelligence. Tree-based planning methods have enjoyed huge success in challeng- ing domains, such as chess and Go, where a perfect simulator is available. However, in real-world problems the dynamics governing the environment are often complex and unknown. In this work we present the MuZero algorithm which, by combining a tree-based search with a learned model, achieves superhuman performance in a range of challenging and visually complex domains, without any knowledge of their underlying dynamics. MuZero learns a model that, when applied iteratively, predicts the quantities most directly relevant to planning: the reward, the action-selection policy, and the value function. When evaluated on 57 different Atari games - the canonical video game environ- ment for testing AI techniques, in which model-based planning approaches have historically struggled - our new algorithm achieved a new state of the art. When evaluated on Go, chess and shogi, without any knowledge of the game rules, MuZero matched the superhuman performance of the AlphaZero algorithm that was supplied with the game rules.
    [Show full text]