Monte-Carlo Tree Search As Regularized Policy Optimization

Total Page:16

File Type:pdf, Size:1020Kb

Monte-Carlo Tree Search As Regularized Policy Optimization Monte-Carlo tree search as regularized policy optimization Jean-Bastien Grill * 1 Florent Altche´ * 1 Yunhao Tang * 1 2 Thomas Hubert 3 Michal Valko 1 Ioannis Antonoglou 3 Remi´ Munos 1 Abstract AlphaZero employs an alternative handcrafted heuristic to achieve super-human performance on board games (Silver The combination of Monte-Carlo tree search et al., 2016). Recent MCTS-based MuZero (Schrittwieser (MCTS) with deep reinforcement learning has et al., 2019) has also led to state-of-the-art results in the led to significant advances in artificial intelli- Atari benchmarks (Bellemare et al., 2013). gence. However, AlphaZero, the current state- of-the-art MCTS algorithm, still relies on hand- Our main contribution is connecting MCTS algorithms, crafted heuristics that are only partially under- in particular the highly-successful AlphaZero, with MPO, stood. In this paper, we show that AlphaZero’s a state-of-the-art model-free policy-optimization algo- search heuristics, along with other common ones rithm (Abdolmaleki et al., 2018). Specifically, we show that such as UCT, are an approximation to the solu- the empirical visit distribution of actions in AlphaZero’s tion of a specific regularized policy optimization search procedure approximates the solution of a regularized problem. With this insight, we propose a variant policy-optimization objective. With this insight, our second of AlphaZero which uses the exact solution to contribution a modified version of AlphaZero that comes this policy optimization problem, and show exper- significant performance gains over the original algorithm, imentally that it reliably outperforms the original especially in cases where AlphaZero has been observed to algorithm in multiple domains. fail, e.g., when per-search simulation budgets are low (Ham- rick et al., 2020). In Section2, we briefly present MCTS with a focus on Al- 1. Introduction phaZero and provide a short summary of the model-free Policy gradient is at the core of many state-of-the-art deep policy-optimization. In Section3, we show that AlphaZero reinforcement learning (RL) algorithms. Among many suc- (and many other MCTS algorithms) computes approximate cessive improvements to the original algorithm (Sutton et al., solutions to a family of regularized policy optimization 2000), regularized policy optimization encompasses a large problems. With this insight, Section4 introduces a modified family of such techniques. Among them trust region policy version of AlphaZero which leverages the benefits of the optimization is a prominent example (Schulman et al., 2015; policy optimization formalism to improve upon the origi- 2017; Abdolmaleki et al., 2018; Song et al., 2019). These al- nal algorithm. Finally, Section5 shows that this modified gorithmic enhancements have led to significant performance algorithm outperforms AlphaZero on Atari games and con- gains in various benchmark domains (Song et al., 2019). tinuous control tasks. As another successful RL framework, the AlphaZero family arXiv:2007.12509v1 [cs.LG] 24 Jul 2020 of algorithms (Silver et al., 2016; 2017b;a; Schrittwieser 2. Background et al., 2019) have obtained groundbreaking results on chal- Consider a standard RL setting tied to a Markov decision lenging domains by combining classical deep learning (He process (MDP) with state space X and action space A. At a et al., 2016) and RL (Williams, 1992) techniques with discrete round t ≥ 0, the agent in state xt 2 X takes action Monte-Carlo tree search (Kocsis and Szepesvari´ , 2006). at 2 A given a policy at ∼ π(·|st), receives reward rt, To search efficiently, the MCTS action selection criteria and transitions to a next state xt+1 ∼ p(·|xt; at). The RL takes inspiration from bandits (Auer, 2002). Interestingly, problem consists in finding a policy which maximizes the P t *Equal contribution 1DeepMind, Paris, FR 2Columbia Univer- discounted cumulative return Eπ[ t≥0 γ rt] for a discount sity, New York, USA 3DeepMind, London, UK. Correspondence factor γ 2 (0; 1). To scale the method to large environments, to: Jean-Bastien Grill <[email protected]>. we assume that the policy πθ(ajx) is parameterized by a neural network θ. Proceedings of the 37 th International Conference on Machine Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by the author(s). MCTS as regularized policy optimization 2.1. AlphaZero updates the current policy πθ by solving a local maximiza- tion problem of the form We focus on the AlphaZero family, comprised of Al- T phaGo (Silver et al., 2016), AlphaGo Zero (Silver π 0 arg max Q y − R(y; π ); θ , πθ θ (2) et al., 2017b), AlphaZero (Silver et al., 2017a), and y2S MuZero (Schrittwieser et al., 2019), which are among the where Qπθ is an estimate of the Q-function, S is the jAj- most successful algorithms in combining model-free and 2 dimensional simplex and R : S ! R a convex regulariza- model-based RL. Although they make different assump- tion term (Neu et al., 2017; Grill et al., 2019; Geist et al., tions, all of these methods share the same underlying search 2019). Intuitively, Eq.2 updates πθ to maximize the value algorithm, which we refer to as AlphaZero for simplicity. QT y πθ while constraining the update with a regularization From a state x, AlphaZero uses MCTS (Browne et al., 2012) term R(y; πθ). to compute an improved policy π^(·|x) at the root of the Without regularizations, i.e., R = 0, Eq.2 reduces to policy search tree from the prior distribution predicted by a policy iteration (Sutton and Barto, 1998). When π is updated 1 θ network πθ(·|x) ; see Eq.3 for the definition. This im- using a single gradient ascent step towards the solution proved policy is then distilled back into πθ by updating θ of Eq.2, instead of using the solution directly, the above as θ θ − ηrθEx[D(^π(·|x); πθ(·|x))] for a certain di- formulation reduces to (regularized) policy gradient (Sutton vergence D. In turn, the distilled parameterized policy πθ et al., 2000; Levine, 2018). informs the next local search by predicting priors, further improving the local policy over successive iterations. There- Interestingly, the regularization term has been found to sta- fore, such an algorithmic procedure is a special case of bilize, and possibly to speed up the convergence of πθ. generalized policy improvement (Sutton and Barto, 1998). For instance, trust region policy search algorithms (TRPO, Schulman et al., 2015; MPO Abdolmaleki et al., 2018; V- One of the main differences between AlphaZero and previ- MPO, Song et al., 2019), set R to be the KL-divergence ous MCTS algorithms such as UCT (Kocsis and Szepesvari´ , between consecutive policies KL[y; πθ]; maximum entropy 2006) is the introduction of a learned prior πθ and value RL (Ziebart, 2010; Fox et al., 2015; O’Donoghue et al., function vθ. Additionally, AlphaZero’s search procedure 2016; Haarnoja et al., 2017) sets R to be the negative en- applies the following action selection heuristic, tropy of y to avoid collapsing to a deterministic policy. " pP # b n(x; b) arg max Q(x; a) + c · πθ(ajx) · , (1) 3. MCTS as regularized policy optimization a 1 + n(x; a) In Section2, we presented AlphaZero that relies on model- where c is a numerical constant,2 n(x; a) is the number of based planning. We also presented policy optimization, a times that action a has been selected from state x during framework that has achieved good performance in model- search, and Q(x; a) is an estimate of the Q-function for free RL. In this section, we establish our main claim state-action pair (x; a) computed from search statistics and namely that AlphaZero’s action selection criteria can be interpreted as approximating the solution to a regularized using vθ for bootstrapping. policy-optimization objective. Intuitively, this selection criteria balances exploration and exploitation, by selecting the most promising actions (high 3.1. Notation Q-value Q(x; a) and prior policy πθ(ajx)) or actions that have rarely been explored (small visit count n(x; a)). We First, let us define the empirical visit distribution π^ as denote by Nsim the simulation budget, i.e., the search is 1 + n(x; a) run with N simulations. A more detailed presentation of π^(ajx) · (3) sim , jAj + P n(x; b) AlphaZero is in AppendixA; for a full description of the b algorithm, refer to Silver et al.(2017a). Note that in Eq.3, we consider an extra visit per action compared to the acting policy and distillation target in the 2.2. Policy optimization original definition (Silver et al., 2016). This extra visit is introduced for convenience in the upcoming analysis (to Policy optimization aims at finding a globally optimal pol- avoid divisions by zero) and does not change the generality π icy θ, generally using iterative updates. Each iteration of our results. 1 We note here that terminologies such as prior follow Silver We also define the multiplier λ as et al.(2017a) and do not relate to concepts in Bayesian statistics. N 2 Schrittwieser et al.(2019) uses a c that has a slow-varying pP n P b b , dependency on b n(x; b), which we omit here for simplicity, as λN (x) , c · P (4) it was the case of Silver et al.(2017a). jAj + b nb MCTS as regularized policy optimization where the shorthand notation na is used for n(x; a), and Remark The factor λN isp a decreasing function of N. P ~ N(x) , b nb denotes the number of visits to x during Asymptotically, λN = O(1= N). Therefore, the influence search. With this notation, the action selection formula of of the regularization term decreases as the number of simu- Eq.1 can be written as selecting the action a? such that lation increases, which makes π¯ rely increasingly more on search Q-values q and less on the policy prior πθ.
Recommended publications
  • Game Changer
    Matthew Sadler and Natasha Regan Game Changer AlphaZero’s Groundbreaking Chess Strategies and the Promise of AI New In Chess 2019 Contents Explanation of symbols 6 Foreword by Garry Kasparov �������������������������������������������������������������������������������� 7 Introduction by Demis Hassabis 11 Preface 16 Introduction ������������������������������������������������������������������������������������������������������������ 19 Part I AlphaZero’s history . 23 Chapter 1 A quick tour of computer chess competition 24 Chapter 2 ZeroZeroZero ������������������������������������������������������������������������������ 33 Chapter 3 Demis Hassabis, DeepMind and AI 54 Part II Inside the box . 67 Chapter 4 How AlphaZero thinks 68 Chapter 5 AlphaZero’s style – meeting in the middle 87 Part III Themes in AlphaZero’s play . 131 Chapter 6 Introduction to our selected AlphaZero themes 132 Chapter 7 Piece mobility: outposts 137 Chapter 8 Piece mobility: activity 168 Chapter 9 Attacking the king: the march of the rook’s pawn 208 Chapter 10 Attacking the king: colour complexes 235 Chapter 11 Attacking the king: sacrifices for time, space and damage 276 Chapter 12 Attacking the king: opposite-side castling 299 Chapter 13 Attacking the king: defence 321 Part IV AlphaZero’s
    [Show full text]
  • Efficiently Mastering the Game of Nogo with Deep Reinforcement
    electronics Article Efficiently Mastering the Game of NoGo with Deep Reinforcement Learning Supported by Domain Knowledge Yifan Gao 1,*,† and Lezhou Wu 2,† 1 College of Medicine and Biological Information Engineering, Northeastern University, Liaoning 110819, China 2 College of Information Science and Engineering, Northeastern University, Liaoning 110819, China; [email protected] * Correspondence: [email protected] † These authors contributed equally to this work. Abstract: Computer games have been regarded as an important field of artificial intelligence (AI) for a long time. The AlphaZero structure has been successful in the game of Go, beating the top professional human players and becoming the baseline method in computer games. However, the AlphaZero training process requires tremendous computing resources, imposing additional difficulties for the AlphaZero-based AI. In this paper, we propose NoGoZero+ to improve the AlphaZero process and apply it to a game similar to Go, NoGo. NoGoZero+ employs several innovative features to improve training speed and performance, and most improvement strategies can be transferred to other nonspecific areas. This paper compares it with the original AlphaZero process, and results show that NoGoZero+ increases the training speed to about six times that of the original AlphaZero process. Moreover, in the experiment, our agent beat the original AlphaZero agent with a score of 81:19 after only being trained by 20,000 self-play games’ data (small in quantity compared with Citation: Gao, Y.; Wu, L. Efficiently 120,000 self-play games’ data consumed by the original AlphaZero). The NoGo game program based Mastering the Game of NoGo with on NoGoZero+ was the runner-up in the 2020 China Computer Game Championship (CCGC) with Deep Reinforcement Learning limited resources, defeating many AlphaZero-based programs.
    [Show full text]
  • ELF Opengo: an Analysis and Open Reimplementation of Alphazero
    ELF OpenGo: An Analysis and Open Reimplementation of AlphaZero Yuandong Tian 1 Jerry Ma * 1 Qucheng Gong * 1 Shubho Sengupta * 1 Zhuoyuan Chen 1 James Pinkerton 1 C. Lawrence Zitnick 1 Abstract However, these advances in playing ability come at signifi- The AlphaGo, AlphaGo Zero, and AlphaZero cant computational expense. A single training run requires series of algorithms are remarkable demonstra- millions of selfplay games and days of training on thousands tions of deep reinforcement learning’s capabili- of TPUs, which is an unattainable level of compute for the ties, achieving superhuman performance in the majority of the research community. When combined with complex game of Go with progressively increas- the unavailability of code and models, the result is that the ing autonomy. However, many obstacles remain approach is very difficult, if not impossible, to reproduce, in the understanding of and usability of these study, improve upon, and extend. promising approaches by the research commu- In this paper, we propose ELF OpenGo, an open-source nity. Toward elucidating unresolved mysteries reimplementation of the AlphaZero (Silver et al., 2018) and facilitating future research, we propose ELF algorithm for the game of Go. We then apply ELF OpenGo OpenGo, an open-source reimplementation of the toward the following three additional contributions. AlphaZero algorithm. ELF OpenGo is the first open-source Go AI to convincingly demonstrate First, we train a superhuman model for ELF OpenGo. Af- superhuman performance with a perfect (20:0) ter running our AlphaZero-style training software on 2,000 record against global top professionals. We ap- GPUs for 9 days, our 20-block model has achieved super- ply ELF OpenGo to conduct extensive ablation human performance that is arguably comparable to the 20- studies, and to identify and analyze numerous in- block models described in Silver et al.(2017) and Silver teresting phenomena in both the model training et al.(2018).
    [Show full text]
  • Alpha Zero Paper
    RESEARCH COMPUTER SCIENCE programmers, combined with a high-performance alpha-beta search that expands a vast search tree by using a large number of clever heuristics and domain-specific adaptations. In (10) we describe A general reinforcement learning these augmentations, focusing on the 2016 Top Chess Engine Championship (TCEC) season 9 algorithm that masters chess, shogi, world champion Stockfish (11); other strong chess programs, including Deep Blue, use very similar and Go through self-play architectures (1, 12). In terms of game tree complexity, shogi is a substantially harder game than chess (13, 14): It David Silver1,2*†, Thomas Hubert1*, Julian Schrittwieser1*, Ioannis Antonoglou1, is played on a larger board with a wider variety of Matthew Lai1, Arthur Guez1, Marc Lanctot1, Laurent Sifre1, Dharshan Kumaran1, pieces; any captured opponent piece switches Thore Graepel1, Timothy Lillicrap1, Karen Simonyan1, Demis Hassabis1† sides and may subsequently be dropped anywhere on the board. The strongest shogi programs, such Thegameofchessisthelongest-studieddomainin the history of artificial intelligence. as the 2017 Computer Shogi Association (CSA) The strongest programs are based on a combination of sophisticated search techniques, world champion Elmo, have only recently de- domain-specific adaptations, and handcrafted evaluation functions that have been refined feated human champions (15). These programs by human experts over several decades. By contrast, the AlphaGo Zero program recently use an algorithm similar to those used by com- achieved superhuman performance in the game of Go by reinforcement learning from self-play. puter chess programs, again based on a highly In this paper, we generalize this approach into a single AlphaZero algorithm that can achieve Downloaded from optimized alpha-beta search engine with many superhuman performance in many challenging games.
    [Show full text]
  • Introduction to Deep Reinforcement Learning
    Parallel & Scalable Machine Learning Introduction to Machine Learning Algorithms Dr. Jenia Jitsev Head of Cross-Sectional Team Deep Learning (CST-DL) Scientific Lead Helmholtz AI Local (HAICU Local) Institute for Advanced Simulation (IAS) Juelich Supercomputing Center (JSC) @Jenia Jitsev LECTURE 9 Introduction to Deep Reinforcement Learning Feb 19th, 2020 JSC, Germany Machine Learning: Forms of Learning ● Supervised learning: correct responses Y for input data X are given; → Y “teacher” signal, correct “outcomes”, “labels” for the data X - usually : estimate unknown f: X→Y, y = f(x; W) - classical frameworks: classification, regression ● Unsupervised learning: only data X is given - find “hidden” structure, patterns in the data; - in general, estimate unknown probability density p(X) - in general : find a model that underlies / generates X - broad class of latent (“hidden”) variable models - classical frameworks: clustering, dimensionality reduction (e.g PCA) ● Reinforcement learning: data X, including (sparse) reward r(X) - discover actions a that maximize total future reward R t f a - active learning: experience X depends on choice of a h c s n i e - estimate p(a|X), p(r|X), V(X), Q(X,a) – future reward predictors m e G - z t l o h m l ● e H For all holds: r e d n i - Define a loss L(D,W), optimize by tuning free parameters W d e i l g t i M Deep Neural Networks: Forms of Learning ● Supervised learning: correct responses Y for input data X are given - find unknown f: y = f(x;W) or density p(Y|X;W) for data (x,y) - Deep CNN for visual object recognition (e.g, Inception, ResNet, ...) ● Unsupervised learning: only data X is given - general setting: estimate unknown density p(X;W) - find a model that underlies / generates X - broad class of latent (“hidden”) variable models - Variational Autoencoder (VAE), data generation and inference - Generative Adversarial Networks (GANs), data generation - Autoregressive models : PixelRNN, PixelCNN, ..
    [Show full text]
  • Tackling Morpion Solitaire with Alphazero-Like Ranked Reward Reinforcement Learning
    Tackling Morpion Solitaire with AlphaZero-like Ranked Reward Reinforcement Learning Hui Wang, Mike Preuss, Michael Emmerich and Aske Plaat Leiden Institute of Advanced Computer Science Leiden University The Netherlands email: [email protected] Abstract—Morpion Solitaire is a popular single player game, AlphaGo and AlphaZero combine deep neural networks [9] performed with paper and pencil. Due to its large state space and Monte Carlo Tree Search (MCTS) [10] in a self-play (on the order of the game of Go) traditional search algorithms, framework that learns by curriculum learning [11]. Unfor- such as MCTS, have not been able to find good solutions. A later algorithm, Nested Rollout Policy Adaptation, was able to find a tunately, these approaches can not be directly used to play new record of 82 steps, albeit with large computational resources. single agent combinatorial games, such as travelling salesman After achieving this record, to the best of our knowledge, there problems (TSP) [12] and bin package problems (BPP) [13], has been no further progress reported, for about a decade. where cost minimization is the goal of the game. To apply In this paper we take the recent impressive performance self-play for single player games, Laterre et al. proposed a of deep self-learning reinforcement learning approaches from AlphaGo/AlphaZero as inspiration to design a searcher for Ranked Reward (R2) algorithm. R2 creates a relative per- Morpion Solitaire. A challenge of Morpion Solitaire is that the formance metric by means of ranking the rewards obtained state space is sparse, there are few win/loss signals.
    [Show full text]
  • Chess Fortresses, a Causal Test for State of the Art Symbolic[Neuro] Architectures
    Chess fortresses, a causal test for state of the art Symbolic[Neuro] architectures Hedinn Steingrimsson Electrical and Computer Engineering Rice University Houston, TX 77005 [email protected] Abstract—The AI community’s growing interest in causality (which can only move on white squares) does not contribute to is motivating the need for benchmarks that test the limits of battling the key g5 square (which is on a black square), which is neural network based probabilistic reasoning and state of the art sufficiently protected by the white pieces. These characteristics search algorithms. In this paper we present such a benchmark, chess fortresses. To make the benchmarking task as challenging beat the laws of probabilistic reasoning, where extra material as possible, we compile a test set of positions in which the typically means increased chances of winning the game. This defender has a unique best move for entering a fortress defense feature makes the new benchmark especially challenging for formation. The defense formation can be invariant to adding a neural based architectures. certain type of chess piece to the attacking side thereby defying the laws of probabilistic reasoning. The new dataset enables Our goal in this paper is to provide a new benchmark of efficient testing of move prediction since every experiment yields logical nature - aimed at being challenging or a hard class - a conclusive outcome, which is a significant improvement over which modern architectures can be measured against. This new traditional methods. We carry out extensive, large scale tests dataset is provided in the Supplement 1. Exploring hard classes of the convolutional neural networks of Leela Chess Zero [1], has proven to be fruitful ground for fundamental architectural an open source chess engine built on the same principles as AlphaZero (AZ), which has passed AZ in chess playing ability improvements in the past as shown by Olga Russakovsky’s according to many rating lists.
    [Show full text]
  • Accelerating and Improving Alphazero Using Population Based Training
    The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Accelerating and Improving AlphaZero Using Population Based Training Ti-Rong Wu,1 Ting-Han Wei,1,3 I-Chen Wu1,2 1Department of Computer Science, National Chiao Tung University, Taiwan 2Pervasive Artificial Intelligence Research (PAIR) Labs, Taiwan 3Department of Computing Science, University of Alberta, Edmonton, Canada {kds285, ting, icwu}@aigames.nctu.edu.tw Abstract a reimplementation of the AlphaGo Zero/AlphaZero algo- rithm, Facebook AI Research is also moving ahead with AlphaZero has been very successful in many games. Unfor- the more general ELF project (Tian et al. 2019), which is tunately, it still consumes a huge amount of computing re- aimed at covering a wider range of games for reinforcement sources, the majority of which is spent in self-play. Hyper- learning research. In late August, 2019, DeepMind also an- parameter tuning exacerbates the training cost since each hy- perparameter configuration requires its own time to train one nounced the OpenSpiel framework, with the goal of incor- run, during which it will generate its own self-play records. porating various games with different properties (Lanctot et As a result, multiple runs are usually needed for different hy- al. 2019). perparameter configurations. This paper proposes using pop- Given the continued interest in using games as a rein- ulation based training (PBT) to help tune hyperparameters forcement learning environment, there are still issues that dynamically and improve strength during training time. An- need to be resolved even with a powerful algorithm such other significant advantage is that this method requires a sin- as AlphaZero.
    [Show full text]
  • A General Reinforcement Learning Algorithm That Masters Chess, Shogi and Go Through Self-Play
    A general reinforcement learning algorithm that masters chess, shogi and Go through self-play David Silver,1;2∗ Thomas Hubert,1∗ Julian Schrittwieser,1∗ Ioannis Antonoglou,1;2 Matthew Lai,1 Arthur Guez,1 Marc Lanctot,1 Laurent Sifre,1 Dharshan Kumaran,1;2 Thore Graepel,1;2 Timothy Lillicrap,1 Karen Simonyan,1 Demis Hassabis1 1DeepMind, 6 Pancras Square, London N1C 4AG. 2University College London, Gower Street, London WC1E 6BT. ∗These authors contributed equally to this work. Abstract The game of chess is the longest-studied domain in the history of artificial intelligence. The strongest programs are based on a combination of sophisticated search techniques, domain-specific adaptations, and handcrafted evaluation functions that have been refined by human experts over several decades. By contrast, the AlphaGo Zero program recently achieved superhuman performance in the game of Go by reinforcement learning from self- play. In this paper, we generalize this approach into a single AlphaZero algorithm that can achieve superhuman performance in many challenging games. Starting from random play and given no domain knowledge except the game rules, AlphaZero convincingly defeated a world champion program in the games of chess and shogi (Japanese chess) as well as Go. The study of computer chess is as old as computer science itself. Charles Babbage, Alan Turing, Claude Shannon, and John von Neumann devised hardware, algorithms and theory to analyse and play the game of chess. Chess subsequently became a grand challenge task for a generation of artificial intelligence researchers, culminating in high-performance computer chess programs that play at a super-human level (1,2).
    [Show full text]
  • Playing Nondeterministic Games Through Planning with a Learned
    Under review as a conference paper at ICLR 2021 PLAYING NONDETERMINISTIC GAMES THROUGH PLANNING WITH A LEARNED MODEL Anonymous authors Paper under double-blind review ABSTRACT The MuZero algorithm is known for achieving high-level performance on tradi- tional zero-sum two-player games of perfect information such as chess, Go, and shogi, as well as visual, non-zero sum, single-player environments such as the Atari suite. Despite lacking a perfect simulator and employing a learned model of environmental dynamics, MuZero produces game-playing agents comparable to its predecessor, AlphaZero. However, the current implementation of MuZero is restricted only to deterministic environments. This paper presents Nondetermin- istic MuZero (NDMZ), an extension of MuZero for nondeterministic, two-player, zero-sum games of perfect information. Borrowing from Nondeterministic Monte Carlo Tree Search and the theory of extensive-form games, NDMZ formalizes chance as a player in the game and incorporates the chance player into the MuZero network architecture and tree search. Experiments show that NDMZ is capable of learning effective strategies and an accurate model of the game. 1 INTRODUCTION While the AlphaZero algorithm achieved superhuman performance in a variety of challenging do- mains, it relies upon a perfect simulation of the environment dynamics to perform precision plan- ning. MuZero, the newest member of the AlphaZero family, combines the advantages of planning with a learned model of its environment, allowing it to tackle problems such as the Atari suite without the advantage of a simulator. This paper presents Nondeterministic MuZero (NDMZ), an extension of MuZero to stochastic, two-player, zero-sum games of perfect information.
    [Show full text]
  • The Impact of Artificial Intelligence on the Chess World
    JMIR SERIOUS GAMES Duca Iliescu Viewpoint The Impact of Artificial Intelligence on the Chess World Delia Monica Duca Iliescu, BSc, MSc Transilvania University of Brasov, Brasov, Romania Corresponding Author: Delia Monica Duca Iliescu, BSc, MSc Transilvania University of Brasov Bdul Eroilor 29 Brasov Romania Phone: 40 268413000 Email: [email protected] Abstract This paper focuses on key areas in which artificial intelligence has affected the chess world, including cheat detection methods, which are especially necessary recently, as there has been an unexpected rise in the popularity of online chess. Many major chess events that were to take place in 2020 have been canceled, but the global popularity of chess has in fact grown in recent months due to easier conversion of the game from offline to online formats compared with other games. Still, though a game of chess can be easily played online, there are some concerns about the increased chances of cheating. Artificial intelligence can address these concerns. (JMIR Serious Games 2020;8(4):e24049) doi: 10.2196/24049 KEYWORDS artificial intelligence; games; chess; AlphaZero; MuZero; cheat detection; coronavirus strategic games, has affected many other areas of interest, as Introduction has already been seen in recent years. All major chess events that were to take place in the second half One of the most famous human versus machine events was the of 2020 have been canceled, including the 44th Chess Olympiad 1997 victory of Deep Blue, an IBM chess software, against the and the match for the title of World Chess Champion. However, famous chess champion Garry Kasparov [3].
    [Show full text]
  • Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model
    Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model Julian Schrittwieser,1∗ Ioannis Antonoglou,1;2∗ Thomas Hubert,1∗ Karen Simonyan,1 Laurent Sifre,1 Simon Schmitt,1 Arthur Guez,1 Edward Lockhart,1 Demis Hassabis,1 Thore Graepel,1;2 Timothy Lillicrap,1 David Silver1;2∗ 1DeepMind, 6 Pancras Square, London N1C 4AG. 2University College London, Gower Street, London WC1E 6BT. ∗These authors contributed equally to this work. Abstract Constructing agents with planning capabilities has long been one of the main challenges in the pursuit of artificial intelligence. Tree-based planning methods have enjoyed huge success in challeng- ing domains, such as chess and Go, where a perfect simulator is available. However, in real-world problems the dynamics governing the environment are often complex and unknown. In this work we present the MuZero algorithm which, by combining a tree-based search with a learned model, achieves superhuman performance in a range of challenging and visually complex domains, without any knowledge of their underlying dynamics. MuZero learns a model that, when applied iteratively, predicts the quantities most directly relevant to planning: the reward, the action-selection policy, and the value function. When evaluated on 57 different Atari games - the canonical video game environ- ment for testing AI techniques, in which model-based planning approaches have historically struggled - our new algorithm achieved a new state of the art. When evaluated on Go, chess and shogi, without any knowledge of the game rules, MuZero matched the superhuman performance of the AlphaZero algorithm that was supplied with the game rules.
    [Show full text]