Alphazero to Alpha Hero a Pre-Study on Additional Tree Sampling Within Self-Play Reinforcement Learning

DEGREE PROJECT IN COMPUTER ENGINEERING, FIRST CYCLE, 15 CREDITS STOCKHOLM, SWEDEN 2019 AlphaZero to Alpha Hero A pre-study on Additional Tree Sampling within Self-Play Reinforcement Learning FREDRIK CARLSSON JOEY ÖHMAN KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE Bachelor in Computer Science Date: June 5, 2019 Supervisor: Jörg Conradt Examiner: Örjan Ekeberg Swedish title: Från AlphaZero till alfahjälte - En förstudie om inklusion av additionella trädobservationer i straffinlärning Abstract In self-play reinforcement learning an agent plays games against itself and with the help of hindsight and retrospection improves its policy over time. Using this premise, Alp- haZero famously managed to become the strongest known Go, Shogi, and Chess entity by training a deep neural network from data collected solely from self-play. AlphaZero couples this deep neural network with a Monte Carlo Tree Search algorithm that dras- tically improves the networks initial policy and state evaluation. When training Alp- haZero relies on the final outcome of the game for the generation of training labels. By altering the learning target to instead make use of the improved state evaluation acquired after the tree search, the creation of training labels for states exclusively visited by tree search becomes possible. We propose the extension of Additional Tree Sampling that exploits the change of learning target and provide theoretical arguments and counter- arguments for the validity of this approach. Further, an empirical analysis is performed on the game Connect Four, which harbors results that justifies the change in learning target. The altered learning target seems to have no negative impact on the final player strength nor on the behavior of the learning algorithm over time. Based on these positive results we encourage further research of Additional Tree Sampling in order to validify or reject the usefulness of this method. Sammanfattning I självspelande straffinlärning spelar en agent mot sig själv. Med hjälp av sofistikera- de algoritmer och tillbakablickande kan agenten lära sig en bra policy över tid. Denna metod har gjort AlphaZero till världens starkaste spelare i Go, Shogi, och Schack genom att träna ett djupt neuralt nätverk med data samlat enbart från självspel. AlphaZero kombinerar detta djupa neurala nätverk med en Monte Carlo Tree Search-algoritm som kraftigt förstärker nätverkets evaluering av ett bräde. Originalversionen av AlphaZero genererar träningsdata med det slutgiltiga resultatet av ett spel som inlärningsmål. Ge- nom att ändra detta inlärningsmål till resultatet av trädsöket istället, möjliggörs skapan- det av träningsdata från bräden som enbart blivit upptäckta genom trädsök. Vi föreslår en utökning, Additional Tree Samling, som utnyttjar denna förändring av inlärnings- mål. Detta följs av teoretiska argument för och emot denna utökning av AlphaZero. Vidare utförs en empirisk analys på spelet Fyra i Rad som styrker faktumet att modifie- ringen av inlärningsmål är rimligt. Det förändrade inlärningsmålet visar inga tecken på att försämra den slutgiltiga spelarens skicklighet eller inlärningsalgoritmens beteende under träning. Vi uppmuntrar, baserat på dessa positiva resultat, ytterligare forskning vad gäller Additional Tree Sampling, för att se huruvida denna metod skulle förändra AlphaZero. Acknowledgements We both feel a strong need to personally thank all of the people that have supported, helped and guided us throughout the work of this thesis. We are truly appreciative of all the good folks that have aided us along the way. Special thanks are in place for senior AI researcher Lars Rasmusson, that showed great interest in our problem and was to great help during many discussions, accumulating into many hours spent in front of a whiteboard. Jörg Conradt, our supervisor, aided us greatly by spending a lot of his time and energy to supply us with much of the needed hardware. His devotion and constant availability were equally commendable, rarely has a humanoid been witnessed to respond so quickly to emails. Finally, a warm thank you is reserved for Sverker Jansson and RISE-SICS, giving us access to their offices and supplying us with many much-needed GPU’s. Contents 1 Introduction 1 1.1 Problem Statement . .2 1.2 Scope . .3 1.3 Purpose . .3 2 Background 4 2.1 Reinforcement Learning . .5 2.1.1 Exploration versus Exploitation . .6 2.1.2 Model-based Reinforcement Learning . .6 2.1.3 State Evaluation . .7 2.2 Deep Learning . .8 2.2.1 Artificial Neural Networks . .8 2.2.2 Training Neural Networks . 10 2.2.3 Skewed Datasets & Data Augmentation . 10 2.2.4 Summary . 12 2.3 Monte Carlo Tree Search . 12 2.3.1 Selection . 14 2.3.2 Evaluation . 14 2.3.3 Expansion . 14 2.3.4 Backpropagation . 15 2.3.5 Monte Carlo Tree Search As A Function . 15 2.4 AlphaGo & AlphaZero Overview . 15 2.4.1 AlphaZero Algorithm . 16 2.4.2 Self-Play . 17 2.4.3 Generating Training Samples . 18 2.4.4 Replay Buffer . 19 2.4.5 Supervised Training . 19 2.4.6 Network Architecture . 19 2.5 Environment . 20 2.5.1 Connect Four . 20 3 Hypothesis 21 3.1 The Learning Target . 21 3.2 Additional Tree Sampling . 22 3.3 Potential Issues . 23 3.4 Quality Of Data . 23 3.5 Skewing The dataset . 24 4 Method 25 4.1 Self-Play . 25 4.2 Replay Buffer . 26 4.3 Pre-processing Data . 26 4.4 Supervised Training . 26 4.5 Network Architecture . 27 4.6 State Representation . 27 4.7 Experiment Setup . 27 4.8 Evaluating Agents . 28 4.8.1 Generalization Performance . 28 4.8.2 Player Strength . 29 4.8.3 Agent Behavior . 29 5 Results 30 5.1 General Domain Performance . 30 5.2 Player Strength Comparison . 33 5.3 Behavior Over Generations . 33 6 Discussion 36 6.1 General Domain Performance . 36 6.2 Player Strength Comparison . 37 6.3 Behavior Over Generations . 37 6.4 Limitations . 38 6.5 Future Research . 38 7 Conclusions 40 References 41 Appendix A The network used in the empirical test provided for Connect Four 43 Appendix B Played Games Against Optimal 45 Appendix C GitHub: Implementation & Pre-trained Agents 46 1 Introduction Thanks to the predefined rules and limited state space, games have throughout history served as a testbed for Artificial Intelligence. The bounded environmental complexity and clear goal often give a good indication of how well a particular agent is performing and acts as a metric of success. Additionally, the artificial nature of most games has the benefit of its ease to simulate, removing the need for hardware agents to interact within the real world. Historically most research within the field of game-related AI has been devoted towards solving classical 2-player board games, such as Chess, Backgammon and Go. One recent milestone for AI in such domains is AlphaZero[1], which achieves superhu- man level strength at Go, Chess, and Shogi by only playing against itself. The predeces- sor AlphaGo[2] was produced and optimized for the game Go and famously managed to defeat one of the world’s best Go players, Lee Sedol. Although AlphaGo, unlike Al- phaZero, bootstrapped from human knowledge, AlphaZero decisively outperformed all versions of AlphaGo. Perhaps even more impressive, AlphaZero also managed to defeat the previously strongest chess AI, Stockfish[3], after only 4 hours of self-play. Where Stockfish has been developed and hand-tuned by AI researches and chess grandmasters for several years. During self-play AlphaZero performs a tree search, exploring possible future states be- fore deciding what move to play. Whereafter making a move, it switches sides and repeats the procedure, playing as the opponent. As a game finishes, a policy label and a value label is created for every visited state, where the value label is taken from the final outcome of the game. By doing this, AlphaZero learns to predict the final outcome of a game from any given state. Since the game outcome is only applicable to states actually visited, this limits the number of training labels that can be created from a single game, excluding states only seen during the intrinsic tree search. 1 CHAPTER 1. INTRODUCTION 2 Changing the evaluation to instead target information collected during tree search, would break the dependency posed on label creation and would theoretically allow for the creation of additional tree labels. This thesis proposes the extension of “Additional Tree Sampling” and provides theoretical arguments to the benefits and viability of this extension. Further, an empirical analysis is performed on the viability of using evaluation labels that do not utilize the final outcome of the game, as this is needed for the introduction of additional tree samples. To our knowledge, the extension of additional tree sampling and its effects are yet to be proposed and properly analyzed, and as of today, there exists very little documentation regarding the effects of altering the evaluation labels. 1.1 Problem Statement AlphaZero currently relies on the final outcome of a game for the creation of evaluation labels. As this outcome is only present for states visited during that specific game, this enforces a strong dependency on what states are possible candidates for label creation. This among other factors, creates a requisite for exploring a vast number of games, as only so much information can be gathered from a single game. We propose an extension to the AlphaZero algorithm, whereby altering the evaluation label allows for the creation of additional training samples. Theoretical arguments and counter-arguments are provided for utilizing additional tree samples. An empirical analysis is performed, providing insights into the effects caused by alternative evaluation labels. The non-theoretical analysis is realized by implement- ing the original algorithm, as well as the modification, allowing for comparison in player strength and overall domain generalization.

Alphazero to Alpha Hero a Pre-Study on Additional Tree Sampling Within Self-Play Reinforcement Learning

Intro to Tensorflow 2.0 MBL, August 2019

Development of Games for Users with Visual Impairment Czech Technical University in Prague Faculty of Electrical Engineering

Game Changer

Keras2c: a Library for Converting Keras Neural Networks to Real-Time Compatible C

Reinforcement Learning (1) Key Concepts & Algorithms

OLIVAW: Mastering Othello with Neither Humans Nor a Penny Antonio Norelli, Alessandro Panconesi Dept

(2021), 2814-2819 Research Article Can Chess Ever Be Solved Na

Download Source Engine for Pc Free Download Source Engine for Pc Free

Monte-Carlo Tree Search As Regularized Policy Optimization

Tensorflow, Theano, Keras, Torch, Caffe Vicky Kalogeiton, Stéphane Lathuilière, Pauline Luc, Thomas Lucas, Konstantin Shmelkov Introduction

The SSDF Chess Engine Rating List, 2019-02

Setting up Your Environment for the TF Developer Certificate Exam