Alphazero with Input Convex Neural Networks

Total Page:16

File Type:pdf, Size:1020Kb

Alphazero with Input Convex Neural Networks DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2020 AlphaZero with Input Convex Neural Networks SHUYUAN ZHANG KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE AlphaZero with Input Convex Neural Networks SHUYUAN ZHANG Master in Machine Learning Date: July 27, 2020 Supervisor: John Folkesson Examiner: Hossein Azizpour School of Electrical Engineering and Computer Science Host company: RISE AB iii Abstract Modelling and solving real-life problems using reinforcement learning (RL) ap- proaches is a typical and important branch in the world of artificial intelligence (AI). For playing board games, AlphaZero has been proved to be successful in games such as Go, Chess, and Shogi against professional human players or other AI counterparts. The very basic components of AlphaZero algorithm are MCTS tree search and deep neural networks for state value and policy pre- diction. These deep neural networks are designed to fit the mapping function between a state and its value/policy to make the initialization of the state val- ue/policy more accurate. In this thesis project, we propose Convex-AlphaZero to exploit a new prediction structure for the state value and policy and test its availability by providing theoretical evidence and experimental results. Instead of using one feed-forward process to get these values, our adaptation treats the problem as an optimization process by using input convex neural networks which can model the state value as a convex function of the policy given the state (i.e. game board configuration). The results of our experiments show that our method outperforms traditional mini-max approaches and worth further research on applying it to games other than Connect Four used in this thesis project. iv Sammanfattning Modellering och lösning av verkliga problem med hjälp av förstärkningsinlär- ningssätt (RL) är en typisk och viktig gren i världen av konstgjord intelligens (AI). För att spela brädspel har AlphaZero visat sig vara framgångsrikt i spel som Go, Chess och Shogi mot professionella mänskliga spelare eller andra AI-motsvarigheter. De mycket grundläggande komponenterna i AlphaZero- algoritmen är MCTS-trädsökning och djupa nervnätverk för statligt värde och policyförutsägelse. Dessa djupa neurala nätverk är utformade för att passa kartläggningsfunktionen mellan ett tillstånd och dess värde / politik för att göra initieringen av tillståndsvärdet / politiken mer exakt. I det här avhandlingsprojek- tet föreslår vi Convex-AlphaZero att utnyttja en ny förutsägelsestruktur för det statliga värdet och policyn och testa dess tillgänglighet genom att tillhandahålla teoretiska bevis och experimentella resultat. Istället för att använda en framåt- riktad process för att få dessa värden, behandlar vår anpassning problemet som en optimeringsprocess genom att använda inmatade konvexa neurala nätverk som kan modellera tillståndsvärdet som en konvex funktion av politiken som ges tillståndet (dvs. spelskortkonfiguration) . Resultaten från våra experiment visar att vår metod överträffar traditionella mini-max-tillvägagångssätt och är värt ytterligare forskning om att använda den på andra spel än Connect Four som används i denna avhandling. Contents 1 Introduction1 1.1 Motivation............................2 1.2 Problem Definition.......................2 1.3 Research Question.......................2 1.4 Scope, challenges, and limitations...............3 1.5 Contributions..........................3 1.6 Societal Impacts.........................4 1.7 Ethical Considerations.....................4 1.8 UN SDG Goals.........................4 1.9 Acknowledgements.......................4 2 Background5 2.1 Reinforcement Learning....................5 2.1.1 Basics of RL......................5 2.1.2 Markov Decision Processes (MDP)..........6 2.1.3 Sampling Methods in RL................7 2.1.4 Policy Optimization Using Policy Gradient Methods.8 2.1.5 Model-based and Model-free RL............9 2.1.6 Deep Reinforcement Learning............. 10 2.1.7 Exploration and Exploitation.............. 11 2.2 AlphaZero Overall Description................. 12 2.2.1 Deep Neural Network in AlphaZero.......... 12 2.2.2 Monte Carlo Tree Search in AlphaZero........ 13 2.2.3 Playing......................... 16 2.2.4 Replay Buffer...................... 16 2.3 Input Convex Neural Networks................. 16 2.3.1 Structure of ICNN................... 17 2.3.2 Inference in ICNN................... 19 2.3.3 Training in ICNN.................... 19 v vi CONTENTS 2.3.4 ICNN in AlphaZero.................. 19 2.3.5 Application of ICNNs in RL.............. 20 2.4 The Game of Connect Four................... 20 2.4.1 Game Introduction................... 20 2.4.2 Previous Solutions of the Game............ 22 3 Methods 23 3.1 Input Convex Neural Network in AlphaZero.......... 23 3.1.1 Network Structure................... 23 3.1.2 Inference........................ 25 3.1.3 Training......................... 26 3.1.4 Causality Reasoning.................. 26 3.2 Variance Reduction....................... 27 3.2.1 Uniform Policy..................... 27 3.2.2 Merging Matching States................ 28 3.3 Other Details.......................... 30 3.3.1 Game State Representation............... 30 3.3.2 Extending Data Set................... 31 3.3.3 Clipping And Normalizing Policies.......... 32 4 Results 33 4.1 Training Curves......................... 33 4.2 Player Strength Comparison.................. 34 4.2.1 Play with a Mini-max Agent.............. 35 4.2.2 AlphaZero vs Convex-AlphaZero........... 37 4.2.3 Winning Rate under Different Decision Time..... 37 4.3 Experiments about the ICNN.................. 39 4.3.1 Raw Network Performance............... 39 4.3.2 Average Game Length................. 41 5 Discussion 43 5.1 Comments on General Performance.............. 43 5.2 Coments on ICNN Performance................ 43 5.3 Limitations........................... 44 5.4 Future Work........................... 44 6 Conclusions 46 Bibliography 47 CONTENTS vii A Computing Platform Configuration 51 Chapter 1 Introduction Human-computer competition on board games has always been a hot topic in the area of computer science and artificial intelligence for decades. As early as 1948, Alan Turing showed the possibility of letting intelligent machines play games such as chess, bridge and poker like a human [1]. Then in 1956, the first chess program Los Alamos [2] was developed by Paul Stein and Mark Wells for the MANIAC I computer, which opened the era of playing chess and other games with computers. Mini-max [3] is a basic algorithm used in game-playing programs. The most challenging task for such programs is searching game states with efficiency. Chess, as one of the most famous board games, has 1047 [4] possible game states. A complete search in such a large game space is impossible even if we use modern supercomputers. As a result, scientists tried hard to reduce the scale of searching, either by limiting the depth of Mini-max search or by pruning [5]. The most successful implementation of Mini-max based approach was Deep Blue [6], a Chess program that defeated the human Chess world champion, Kasparov, at the time in 1997. When it comes to Go, however, previous algorithms no longer works because Go has 10170 game states (10123 times more than Chess), which makes the Mini-max based approach infeasible. Although the game of Go is extremely complex, it did not discourage scientists as they then tried some more modern approaches such as reinforcement learning (RL) [7] and deep learning. The combination of RL and searching algorithms finally gave birth to AlphaGo [8], which again defeated a human world champion Lee Sedol in 2016. 1 2 CHAPTER 1. INTRODUCTION 1.1 Motivation The direct motivation of this project is to propose and evaluate a new variant of AlphaZero, which is called Convex-AlphaZero. It treats state value as a func- tion of both state and policy using input convex neural networks. AlphaZero achieved state of the art results when compared with other AI programs, but the network in AlphaZero mapped states directly to their values, thus ignored the relationship between policies and values. Convex-AlphaZero may benefit from introducing an extra causal link between current policy and value using a 2-inputs, 1-output network. We would like to investigate the performance of the proposed method and see how the modification will affect the agent’s overall performance. The motivation of this project goes beyond applying Convex-AlphaZero to board games. In real-life automatic control or planning, the amount of data sometimes can be extremely large. Data pre-processing is expensive and su- pervised learning seems infeasible. However, using reinforcement learning algorithms like AlphaZero, which do not need human force to process the data while reduce the computing resource cost by sampling, may introduce a new form of artificial intelligence to these practical areas. 1.2 Problem Definition AlphaZero currently uses a feed-forward network to predict a given state’s value and best policy. We change this network to an input convex neural network to model the state value in another way. Input convex neural networks have been proven to be effective in areas such as multi-label classification, image completion, and continuous action reinforcement learning. In conclusion, the main problem we want to discuss can be boiled down to : How does altering the convolutional neural network to input convex neural network in AlphaZero affect the
Recommended publications
  • Game Changer
    Matthew Sadler and Natasha Regan Game Changer AlphaZero’s Groundbreaking Chess Strategies and the Promise of AI New In Chess 2019 Contents Explanation of symbols 6 Foreword by Garry Kasparov �������������������������������������������������������������������������������� 7 Introduction by Demis Hassabis 11 Preface 16 Introduction ������������������������������������������������������������������������������������������������������������ 19 Part I AlphaZero’s history . 23 Chapter 1 A quick tour of computer chess competition 24 Chapter 2 ZeroZeroZero ������������������������������������������������������������������������������ 33 Chapter 3 Demis Hassabis, DeepMind and AI 54 Part II Inside the box . 67 Chapter 4 How AlphaZero thinks 68 Chapter 5 AlphaZero’s style – meeting in the middle 87 Part III Themes in AlphaZero’s play . 131 Chapter 6 Introduction to our selected AlphaZero themes 132 Chapter 7 Piece mobility: outposts 137 Chapter 8 Piece mobility: activity 168 Chapter 9 Attacking the king: the march of the rook’s pawn 208 Chapter 10 Attacking the king: colour complexes 235 Chapter 11 Attacking the king: sacrifices for time, space and damage 276 Chapter 12 Attacking the king: opposite-side castling 299 Chapter 13 Attacking the king: defence 321 Part IV AlphaZero’s
    [Show full text]
  • Reinforcement Learning (1) Key Concepts & Algorithms
    Winter 2020 CSC 594 Topics in AI: Advanced Deep Learning 5. Deep Reinforcement Learning (1) Key Concepts & Algorithms (Most content adapted from OpenAI ‘Spinning Up’) 1 Noriko Tomuro Reinforcement Learning (RL) • Reinforcement Learning (RL) is a type of Machine Learning where an agent learns to achieve a goal by interacting with the environment -- trial and error. • RL is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning. • The purpose of RL is to learn an optimal policy that maximizes the return for the sequences of agent’s actions (i.e., optimal policy). https://en.wikipedia.org/wiki/Reinforcement_learning 2 • RL have recently enjoyed a wide variety of successes, e.g. – Robot controlling in simulation as well as in the real world Strategy games such as • AlphaGO (by Google DeepMind) • Atari games https://en.wikipedia.org/wiki/Reinforcement_learning 3 Deep Reinforcement Learning (DRL) • A policy is essentially a function, which maps the agent’s each action to the expected return or reward. Deep Reinforcement Learning (DRL) uses deep neural networks for the function (and other components). https://spinningup.openai.com/en/latest/spinningup/rl_intro.html 4 5 Some Key Concepts and Terminology 1. States and Observations – A state s is a complete description of the state of the world. For now, we can think of states belonging in the environment. – An observation o is a partial description of a state, which may omit information. – A state could be fully or partially observable to the agent. If partial, the agent forms an internal state (or state estimation). https://spinningup.openai.com/en/latest/spinningup/rl_intro.html 6 • States and Observations (cont.) – In deep RL, we almost always represent states and observations by a real-valued vector, matrix, or higher-order tensor.
    [Show full text]
  • OLIVAW: Mastering Othello with Neither Humans Nor a Penny Antonio Norelli, Alessandro Panconesi Dept
    1 OLIVAW: Mastering Othello with neither Humans nor a Penny Antonio Norelli, Alessandro Panconesi Dept. of Computer Science, Universita` La Sapienza, Rome, Italy Abstract—We introduce OLIVAW, an AI Othello player adopt- of companies. Another aspect of the same problem is the ing the design principles of the famous AlphaGo series. The main amount of training needed. AlphaGo Zero required 4.9 million motivation behind OLIVAW was to attain exceptional competence games played during self-play. While to attain the level of in a non-trivial board game at a tiny fraction of the cost of its illustrious predecessors. In this paper, we show how the grandmaster for games like Starcraft II and Dota 2 the training AlphaGo Zero’s paradigm can be successfully applied to the required 200 years and more than 10,000 years of gameplay, popular game of Othello using only commodity hardware and respectively [7], [8]. free cloud services. While being simpler than Chess or Go, Thus one of the major problems to emerge in the wake Othello maintains a considerable search space and difficulty of these breakthroughs is whether comparable results can be in evaluating board positions. To achieve this result, OLIVAW implements some improvements inspired by recent works to attained at a much lower cost– computational and financial– accelerate the standard AlphaGo Zero learning process. The and with just commodity hardware. In this paper we take main modification implies doubling the positions collected per a small step in this direction, by showing that AlphaGo game during the training phase, by including also positions not Zero’s successful paradigm can be replicated for the game played but largely explored by the agent.
    [Show full text]
  • Monte-Carlo Tree Search As Regularized Policy Optimization
    Monte-Carlo tree search as regularized policy optimization Jean-Bastien Grill * 1 Florent Altche´ * 1 Yunhao Tang * 1 2 Thomas Hubert 3 Michal Valko 1 Ioannis Antonoglou 3 Remi´ Munos 1 Abstract AlphaZero employs an alternative handcrafted heuristic to achieve super-human performance on board games (Silver The combination of Monte-Carlo tree search et al., 2016). Recent MCTS-based MuZero (Schrittwieser (MCTS) with deep reinforcement learning has et al., 2019) has also led to state-of-the-art results in the led to significant advances in artificial intelli- Atari benchmarks (Bellemare et al., 2013). gence. However, AlphaZero, the current state- of-the-art MCTS algorithm, still relies on hand- Our main contribution is connecting MCTS algorithms, crafted heuristics that are only partially under- in particular the highly-successful AlphaZero, with MPO, stood. In this paper, we show that AlphaZero’s a state-of-the-art model-free policy-optimization algo- search heuristics, along with other common ones rithm (Abdolmaleki et al., 2018). Specifically, we show that such as UCT, are an approximation to the solu- the empirical visit distribution of actions in AlphaZero’s tion of a specific regularized policy optimization search procedure approximates the solution of a regularized problem. With this insight, we propose a variant policy-optimization objective. With this insight, our second of AlphaZero which uses the exact solution to contribution a modified version of AlphaZero that comes this policy optimization problem, and show exper- significant performance gains over the original algorithm, imentally that it reliably outperforms the original especially in cases where AlphaZero has been observed to algorithm in multiple domains.
    [Show full text]
  • Towards Incremental Agent Enhancement for Evolving Games
    Evaluating Reinforcement Learning Algorithms For Evolving Military Games James Chao*, Jonathan Sato*, Crisrael Lucero, Doug S. Lange Naval Information Warfare Center Pacific *Equal Contribution ffi[email protected] Abstract games in 2013 (Mnih et al. 2013), Google DeepMind devel- oped AlphaGo (Silver et al. 2016) that defeated world cham- In this paper, we evaluate reinforcement learning algorithms pion Lee Sedol in the game of Go using supervised learning for military board games. Currently, machine learning ap- and reinforcement learning. One year later, AlphaGo Zero proaches to most games assume certain aspects of the game (Silver et al. 2017b) was able to defeat AlphaGo with no remain static. This methodology results in a lack of algorithm robustness and a drastic drop in performance upon chang- human knowledge and pure reinforcement learning. Soon ing in-game mechanics. To this end, we will evaluate general after, AlphaZero (Silver et al. 2017a) generalized AlphaGo game playing (Diego Perez-Liebana 2018) AI algorithms on Zero to be able to play more games including Chess, Shogi, evolving military games. and Go, creating a more generalized AI to apply to differ- ent problems. In 2018, OpenAI Five used five Long Short- term Memory (Hochreiter and Schmidhuber 1997) neural Introduction networks and a Proximal Policy Optimization (Schulman et al. 2017) method to defeat a professional DotA team, each AlphaZero (Silver et al. 2017a) described an approach that LSTM acting as a player in a team to collaborate and achieve trained an AI agent through self-play to achieve super- a common goal. AlphaStar used a transformer (Vaswani et human performance.
    [Show full text]
  • Understanding & Generalizing Alphago Zero
    Under review as a conference paper at ICLR 2019 UNDERSTANDING &GENERALIZING ALPHAGO ZERO Anonymous authors Paper under double-blind review ABSTRACT AlphaGo Zero (AGZ) (Silver et al., 2017b) introduced a new tabula rasa rein- forcement learning algorithm that has achieved superhuman performance in the games of Go, Chess, and Shogi with no prior knowledge other than the rules of the game. This success naturally begs the question whether it is possible to develop similar high-performance reinforcement learning algorithms for generic sequential decision-making problems (beyond two-player games), using only the constraints of the environment as the “rules.” To address this challenge, we start by taking steps towards developing a formal understanding of AGZ. AGZ includes two key innovations: (1) it learns a policy (represented as a neural network) using super- vised learning with cross-entropy loss from samples generated via Monte-Carlo Tree Search (MCTS); (2) it uses self-play to learn without training data. We argue that the self-play in AGZ corresponds to learning a Nash equilibrium for the two-player game; and the supervised learning with MCTS is attempting to learn the policy corresponding to the Nash equilibrium, by establishing a novel bound on the difference between the expected return achieved by two policies in terms of the expected KL divergence (cross-entropy) of their induced distributions. To extend AGZ to generic sequential decision-making problems, we introduce a robust MDP framework, in which the agent and nature effectively play a zero-sum game: the agent aims to take actions to maximize reward while nature seeks state transitions, subject to the constraints of that environment, that minimize the agent’s reward.
    [Show full text]
  • Efficiently Mastering the Game of Nogo with Deep Reinforcement
    electronics Article Efficiently Mastering the Game of NoGo with Deep Reinforcement Learning Supported by Domain Knowledge Yifan Gao 1,*,† and Lezhou Wu 2,† 1 College of Medicine and Biological Information Engineering, Northeastern University, Liaoning 110819, China 2 College of Information Science and Engineering, Northeastern University, Liaoning 110819, China; [email protected] * Correspondence: [email protected] † These authors contributed equally to this work. Abstract: Computer games have been regarded as an important field of artificial intelligence (AI) for a long time. The AlphaZero structure has been successful in the game of Go, beating the top professional human players and becoming the baseline method in computer games. However, the AlphaZero training process requires tremendous computing resources, imposing additional difficulties for the AlphaZero-based AI. In this paper, we propose NoGoZero+ to improve the AlphaZero process and apply it to a game similar to Go, NoGo. NoGoZero+ employs several innovative features to improve training speed and performance, and most improvement strategies can be transferred to other nonspecific areas. This paper compares it with the original AlphaZero process, and results show that NoGoZero+ increases the training speed to about six times that of the original AlphaZero process. Moreover, in the experiment, our agent beat the original AlphaZero agent with a score of 81:19 after only being trained by 20,000 self-play games’ data (small in quantity compared with Citation: Gao, Y.; Wu, L. Efficiently 120,000 self-play games’ data consumed by the original AlphaZero). The NoGo game program based Mastering the Game of NoGo with on NoGoZero+ was the runner-up in the 2020 China Computer Game Championship (CCGC) with Deep Reinforcement Learning limited resources, defeating many AlphaZero-based programs.
    [Show full text]
  • AI Chips: What They Are and Why They Matter
    APRIL 2020 AI Chips: What They Are and Why They Matter An AI Chips Reference AUTHORS Saif M. Khan Alexander Mann Table of Contents Introduction and Summary 3 The Laws of Chip Innovation 7 Transistor Shrinkage: Moore’s Law 7 Efficiency and Speed Improvements 8 Increasing Transistor Density Unlocks Improved Designs for Efficiency and Speed 9 Transistor Design is Reaching Fundamental Size Limits 10 The Slowing of Moore’s Law and the Decline of General-Purpose Chips 10 The Economies of Scale of General-Purpose Chips 10 Costs are Increasing Faster than the Semiconductor Market 11 The Semiconductor Industry’s Growth Rate is Unlikely to Increase 14 Chip Improvements as Moore’s Law Slows 15 Transistor Improvements Continue, but are Slowing 16 Improved Transistor Density Enables Specialization 18 The AI Chip Zoo 19 AI Chip Types 20 AI Chip Benchmarks 22 The Value of State-of-the-Art AI Chips 23 The Efficiency of State-of-the-Art AI Chips Translates into Cost-Effectiveness 23 Compute-Intensive AI Algorithms are Bottlenecked by Chip Costs and Speed 26 U.S. and Chinese AI Chips and Implications for National Competitiveness 27 Appendix A: Basics of Semiconductors and Chips 31 Appendix B: How AI Chips Work 33 Parallel Computing 33 Low-Precision Computing 34 Memory Optimization 35 Domain-Specific Languages 36 Appendix C: AI Chip Benchmarking Studies 37 Appendix D: Chip Economics Model 39 Chip Transistor Density, Design Costs, and Energy Costs 40 Foundry, Assembly, Test and Packaging Costs 41 Acknowledgments 44 Center for Security and Emerging Technology | 2 Introduction and Summary Artificial intelligence will play an important role in national and international security in the years to come.
    [Show full text]
  • ELF Opengo: an Analysis and Open Reimplementation of Alphazero
    ELF OpenGo: An Analysis and Open Reimplementation of AlphaZero Yuandong Tian 1 Jerry Ma * 1 Qucheng Gong * 1 Shubho Sengupta * 1 Zhuoyuan Chen 1 James Pinkerton 1 C. Lawrence Zitnick 1 Abstract However, these advances in playing ability come at signifi- The AlphaGo, AlphaGo Zero, and AlphaZero cant computational expense. A single training run requires series of algorithms are remarkable demonstra- millions of selfplay games and days of training on thousands tions of deep reinforcement learning’s capabili- of TPUs, which is an unattainable level of compute for the ties, achieving superhuman performance in the majority of the research community. When combined with complex game of Go with progressively increas- the unavailability of code and models, the result is that the ing autonomy. However, many obstacles remain approach is very difficult, if not impossible, to reproduce, in the understanding of and usability of these study, improve upon, and extend. promising approaches by the research commu- In this paper, we propose ELF OpenGo, an open-source nity. Toward elucidating unresolved mysteries reimplementation of the AlphaZero (Silver et al., 2018) and facilitating future research, we propose ELF algorithm for the game of Go. We then apply ELF OpenGo OpenGo, an open-source reimplementation of the toward the following three additional contributions. AlphaZero algorithm. ELF OpenGo is the first open-source Go AI to convincingly demonstrate First, we train a superhuman model for ELF OpenGo. Af- superhuman performance with a perfect (20:0) ter running our AlphaZero-style training software on 2,000 record against global top professionals. We ap- GPUs for 9 days, our 20-block model has achieved super- ply ELF OpenGo to conduct extensive ablation human performance that is arguably comparable to the 20- studies, and to identify and analyze numerous in- block models described in Silver et al.(2017) and Silver teresting phenomena in both the model training et al.(2018).
    [Show full text]
  • Applying Deep Double Q-Learning and Monte Carlo Tree Search to Playing Go
    CS221 FINAL PAPER 1 Applying Deep Double Q-Learning and Monte Carlo Tree Search to Playing Go Booher, Jonathan [email protected] De Alba, Enrique [email protected] Kannan, Nithin [email protected] I. INTRODUCTION the current position. By sampling states from the self play OR our project we replicate many of the methods games along with their respective rewards, the researchers F used in AlphaGo Zero to make an optimal Go player; were able to train a binary classifier to predict the outcome of a however, we modified the learning paradigm to a version of game with a certain confidence. Then based on the confidence Deep Q-Learning which we believe would result in better measures, the optimal move was taken. generalization of the network to novel positions. We depart from this method of training and use Deep The modification of Deep Q-Learning that we use is Double Q-Learning instead. We use the same concept of called Deep Double Q-Learning and will be described later. sampling states and their rewards from the games of self play, The evaluation metric for the success of our agent is the but instead of feeding this information to a binary classifier, percentage of games that are won against our Oracle, a we feed the information to a modified Q-Learning formula, Go-playing bot available in the OpenAI Gym. Since we are which we present shortly. implementing a version of reinforcement learning, there is no data that we will need other than the simulator. By training III. CHALLENGES on the games that are generated from self-play, our player The main challenge we faced was the computational com- will output a policy that is learned at the end of training by plexity of the game of Go.
    [Show full text]
  • Minigo: a Case Study in Reproducing Reinforcement Learning Research
    Minigo: A Case Study in Reproducing Reinforcement Learning Research Brian Lee, Andrew Jackson, Tom Madams, Seth Troisi, Derek Jones Google, Inc. {brianklee, jacksona, tmadams, sethtroisi, dtj}@google.com Abstract The reproducibility of reinforcement-learning research has been highlighted as a key challenge area in the field. In this paper, we present a case study in reproducing the results of one groundbreaking algorithm, AlphaZero, a reinforcement learning system that learns how to play Go at a superhuman level given only the rules of the game. We describe Minigo, a reproduction of the AlphaZero system using publicly available Google Cloud Platform infrastructure and Google Cloud TPUs. The Minigo system includes both the central reinforcement learning loop as well as auxiliary monitoring and evaluation infrastructure. With ten days of training from scratch on 800 Cloud TPUs, Minigo can play evenly against LeelaZero and ELF OpenGo, two of the strongest publicly available Go AIs. We discuss the difficulties of scaling a reinforcement learning system and the monitoring systems required to understand the complex interplay of hyperparameter configurations. 1 Introduction In March 2016, Google DeepMind’s AlphaGo [1] defeated world champion Lee Sedol by using two deep neural networks (a policy and a value network) and Monte Carlo Tree Search (MCTS) to synthesize the output of these two neural networks. The policy network was trained via supervised learning from human games, and the value network was trained from a much larger corpus of synthetic games generated by sampling game trajectories from the policy network. AlphaGo Zero[2], published in October 2017, described a continuous pipeline, which when initialized with random weights, could train itself to defeat the original AlphaGo system.
    [Show full text]
  • Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm
    Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm David Silver,1∗ Thomas Hubert,1∗ Julian Schrittwieser,1∗ Ioannis Antonoglou,1 Matthew Lai,1 Arthur Guez,1 Marc Lanctot,1 Laurent Sifre,1 Dharshan Kumaran,1 Thore Graepel,1 Timothy Lillicrap,1 Karen Simonyan,1 Demis Hassabis1 1DeepMind, 6 Pancras Square, London N1C 4AG. ∗These authors contributed equally to this work. Abstract The game of chess is the most widely-studied domain in the history of artificial intel- ligence. The strongest programs are based on a combination of sophisticated search tech- niques, domain-specific adaptations, and handcrafted evaluation functions that have been refined by human experts over several decades. In contrast, the AlphaGo Zero program recently achieved superhuman performance in the game of Go by tabula rasa reinforce- ment learning from games of self-play. In this paper, we generalise this approach into a single AlphaZero algorithm that can achieve, tabula rasa, superhuman performance in many challenging games. Starting from random play, and given no domain knowledge ex- cept the game rules, AlphaZero achieved within 24 hours a superhuman level of play in the games of chess and shogi (Japanese chess) as well as Go, and convincingly defeated a world-champion program in each case. The study of computer chess is as old as computer science itself. Babbage, Turing, Shan- non, and von Neumann devised hardware, algorithms and theory to analyse and play the game of chess. Chess subsequently became the grand challenge task for a generation of artificial intel- ligence researchers, culminating in high-performance computer chess programs that perform at superhuman level (9, 14).
    [Show full text]