Improving Alphazero Using Monte-Carlo Graph Search

Total Page:16

File Type:pdf, Size:1020Kb

Improving Alphazero Using Monte-Carlo Graph Search Proceedings of the Thirty-First International Conference on Automated Planning and Scheduling (ICAPS 2021) Improving AlphaZero Using Monte-Carlo Graph Search Johannes Czech1, Patrick Korus1, Kristian Kersting1, 2, 3 1 Department of Computer Science, TU Darmstadt, Germany 2 Centre for Cognitive Science, TU Darmstadt, Germany 3 Hessian Center for Artificial Intelligence (hessian.AI), Darmstadt, Germany [email protected], [email protected], [email protected] Abstract A 8 AlphaZero The algorithm has been successfully applied in a 7 range of discrete domains, most notably board games. It uti- lizes a neural network that learns a value and policy func- 6 tion to guide the exploration in a Monte-Carlo Tree Search. Although many search improvements such as graph search 5 have been proposed for Monte-Carlo Tree Search in the past, 4 most of them refer to an older variant of the Upper Confi- dence bounds for Trees algorithm that does not use a pol- 3 icy for planning. We improve the search algorithm for Alp- 2 haZero by generalizing the search tree to a directed acyclic graph. This enables information flow across different sub- 1 trees and greatly reduces memory consumption. Along with a b c d e f g h Monte-Carlo Graph Search, we propose a number of further extensions, such as the inclusion of -greedy exploration, a B C revised terminal solver and the integration of domain knowl- edge as constraints. In our empirical evaluations, we use the e4 Nf3 e4 Nf3 CrazyAra engine on chess and crazyhouse as examples to Nc6 Nc6 show that these changes bring significant improvements to e5 Nc6 e5 Nc6 e5 e5 AlphaZero. Nf3 Nf3 e4 e4 Nf3 e4 Nf3 e4 Introduction The planning process of most humans for discrete domains Nc6 e5 Nc6 e5 Nc6 e5 is said to resemble the AlphaZero Monte-Carlo Tree Search (MCTS) variant (Silver et al. 2017), which uses a guid- Figure 1: It is possible to obtain the King’s Knight Open- ance policy as its prior and an evaluation function to dis- ing (A) with different move sequences (B, C). Trajectories tinguish between good and bad positions (Hassabis et al. in bold are the most common move order to reach the final 2017). The human reasoning process, however, may not be position. As one can see, graphs are a much more concise as stoic and structured as a search algorithm running on a representation. computer, and humans are typically not able to process as many chess board positions in a given time. Nevertheless, humans are quite good at making valid connections between Graph Search (MCGS) (Saffidine, Cazenave, and Mehat´ different subproblems and to reuse intermediate results for 2012). The search in the DAG follows the scheme of the other subproblems. In other words, humans not only look at Upper Confidence bounds for Trees (UCT) algorithm (Auer, a problem step by step but are also able to jump between se- Cesa-Bianchi, and Fischer 2002), but employs a modified quences of moves, so called trajectories; they seem to have forward and backpropagation procedure to cope with the some kind of global memory buffer to store relevant infor- graph structure. Figure 1 illustrates, how nodes with more mation. This gives a crucial advantage over a traditional tree than one parent, so called transposition nodes, allow to share search, which does not share information between different information between different subtrees and vastly improve trajectories, although an identical state may occur. memory efficiency. Triggered by this intuition, the search tree is generalized More importantly, a graph search reduces the amount of to a Directed Acyclic Graph (DAG), yielding Monte-Carlo neural network evaluations that normally take place when Copyright © 2021, Association for the Advancement of Artificial the search reaches a leaf node. Together, MCGS can result Intelligence (www.aaai.org). All rights reserved. in a substantial performance boost, both for a fixed time con- 103 straint and for a fixed amount of evaluations. This is demon- The values for d1, d2 and d3 relate to the respective depth strated in an empirical evaluation on the example of chess and were chosen either to be 0, 1, 2 or 1. Their algorithm and its variant crazyhouse. was tested on several environments, including a toy envi- Please note, that we intentionally keep our focus here on ronment called LEFTRIGHT, on the board game HEX and the planning aspect of AlphaZero. We provide a novel plan- on a 6 × 6 GO board using 100, 1000 and 10 000 simu- ning algorithm that is based on DAGs and comes with a lations, respectively. Their results were mixed. Depending number of additional enhancements. Specifically, our con- on the game, a different hyperparameter constellation per- tributions for improving the search of AlphaZero are as fol- formed best and sometimes even the default UCT-formula, lows: corresponding to (d1 = 0; d2 = 0; d3 = 0), achieved the 1. Transforming the search tree of the AlphaZero framework highest score. into a DAG and providing a backpropagation algorithm Choe and Kim (2019) and Bauer, Patten, and Vincze which is stable both for low and high simulation counts. (2019) build upon the approach of Saffidine, Cazenave, and Mehat´ (2012). More specifically, Choe and Kim (2019) 2. Introducing a terminal solver to make optimal choices in modified the selection formula (2) while reducing the num- the tree or search graph, in situations where outcomes can ber of hyperparameters to two and applied the algorithm be computed exactly. to the card-game Heartstone. Similarly, Bauer, Patten, and 3. Combining the UCT algorithm with -greedy search Vincze (2019) employed a tuned selection policy formula which helps UCT to escape local optima. of (2) which considers the empirical variance of rewards and 4. Using Q-value information for move selection which used it for the task of object pose verification. helps to switch to the second best candidate move faster Graphs were also studied as an alternative representation if there is a gap in their Q-values. in a trial-based heuristic tree search framework by Keller 5. Adding constraints to narrow the search to events that and Helmert (2013) and further elaborated by Keller (2015). are important with respect to the domain. In the case of Moreover, Childs, Brodeur, and Kocsis (2008) explored dif- chess these are trajectories that include checks, captures ferent variants for exploiting transpositions in UCT. Finally, and threats. Pelissier,´ Nakamura, and Tabata (2019) made use of trans- positions for the task of feature selection. All other directed We proceed as follows. We start off by discussing related acyclic graph search approaches have in common that they work. Then, we give a quick recap of the PUCT algorithm were developed and evaluated in UCT-like framework which (Rosin 2011), the variant of UCT used in AlphaZero, and does not include an explicit policy distribution. explain our realisation of MCGS and each enhancement in- Also the other enhancement suggested in the resent paper, dividually. Before concluding, we touch upon our empirical namely the terminal solver, extends an already existing con- results. We explicitly exclude Reinforcement Learning (RL) cept. Chen, Chen, and Lin (2018) build up on the work by and use neural network models with pre-trained weights in- Winands, Bjornsson,¨ and Saito (2008) and presented a ter- stead. However, we give insight about the stability of the minal solver for MCTS which is able to deal with drawing search modifications by performing experiments under dif- nodes and allows to prune losing nodes. ferent simulation counts and time controls. Finally, challenged by the Montezuma’s Revenge environ- ment, which is a hard exploration problem with sparse re- Related Work wards, Ecoffet et al. (2021) described an algorithm called There exists a lot of prior work on improving UCT search, Go-Explore which remembers promising states and their re- such as the Rapid Action Value Estimation (RAVE) heuris- spective trajectories. Subsequently, they are able to both re- tic (Gelly and Silver 2011), move ordering and paralleliza- turn to these states and to robustify and optimize the cor- tion regimes. A comprehensive overview of these techniques responding trajectory. Their work only indirectly influenced is covered by Browne et al. (2012). However, these earlier this paper, but gave motivation to abandon the search tree techniques focus on a version of UCT, which only relies on a structure and to keep a memory of trajectories in our pro- value without a policy approximator. Consequently, some of posed methods. these extensions became obsolete in practice, providing an insignificant improvement or even deteriorated the perfor- The PUCT Algorithm mance. Each of the proposed enhancements also increases The essential idea behind the UCT algorithm is to iteratively complexity, and most require a full re-evaluation when they build a search tree in order to find a good choice within a are used for PUCT combined with a deep neural network. given time limit (Auer, Cesa-Bianchi, and Fischer 2002). Saffidine, Cazenave, and Mehat´ (2012) proposed to use Nodes represent states st and edges denote actions at for the UCT algorithm with a DAG and suggested an adapted each time step t. Consider e. g. chess. Here, nodes represent UCT move selection formula (1). It selects the next move at board positions and edges denote legal moves that transi- N3 with additional hyperparameters (d1; d2; d3) 2 by tion to a new board position. Now, each iteration of the tree at = argmaxa Qd1 (st; a) + Ud2;d3 (st; a) ; (1) search consists of three steps. First, selecting a leaf node, where then expanding and evaluating that leaf node and, finally, up- V (s ) N(s ) s P dating the values t and visit counts t on all nodes log b Nd2 (st; b) in the trajectory, from the selected leaf node up to the root Ud2;d3 (st; a) = cpuct · : (2) Nd3 (st; a) node.
Recommended publications
  • Monte-Carlo Tree Search Solver
    Monte-Carlo Tree Search Solver Mark H.M. Winands1, Yngvi BjÄornsson2, and Jahn-Takeshi Saito1 1 Games and AI Group, MICC, Universiteit Maastricht, Maastricht, The Netherlands fm.winands,[email protected] 2 School of Computer Science, Reykjav¶³kUniversity, Reykjav¶³k,Iceland [email protected] Abstract. Recently, Monte-Carlo Tree Search (MCTS) has advanced the ¯eld of computer Go substantially. In this article we investigate the application of MCTS for the game Lines of Action (LOA). A new MCTS variant, called MCTS-Solver, has been designed to play narrow tacti- cal lines better in sudden-death games such as LOA. The variant di®ers from the traditional MCTS in respect to backpropagation and selection strategy. It is able to prove the game-theoretical value of a position given su±cient time. Experiments show that a Monte-Carlo LOA program us- ing MCTS-Solver defeats a program using MCTS by a winning score of 65%. Moreover, MCTS-Solver performs much better than a program using MCTS against several di®erent versions of the world-class ®¯ pro- gram MIA. Thus, MCTS-Solver constitutes genuine progress in using simulation-based search approaches in sudden-death games, signi¯cantly improving upon MCTS-based programs. 1 Introduction For decades ®¯ search has been the standard approach used by programs for playing two-person zero-sum games such as chess and checkers (and many oth- ers). Over the years many search enhancements have been proposed for this framework. However, in some games where it is di±cult to construct an accurate positional evaluation function (e.g., Go) the ®¯ approach was hardly success- ful.
    [Show full text]
  • Αβ-Based Play-Outs in Monte-Carlo Tree Search
    αβ-based Play-outs in Monte-Carlo Tree Search Mark H.M. Winands Yngvi Bjornsson¨ Abstract— Monte-Carlo Tree Search (MCTS) is a recent [9]. In the context of game playing, Monte-Carlo simulations paradigm for game-tree search, which gradually builds a game- were first used as a mechanism for dynamically evaluating tree in a best-first fashion based on the results of randomized the merits of leaf nodes of a traditional αβ-based search [10], simulation play-outs. The performance of such an approach is highly dependent on both the total number of simulation play- [11], [12], but under the new paradigm MCTS has evolved outs and their quality. The two metrics are, however, typically into a full-fledged best-first search procedure that replaces inversely correlated — improving the quality of the play- traditional αβ-based search altogether. MCTS has in the past outs generally involves adding knowledge that requires extra couple of years substantially advanced the state-of-the-art computation, thus allowing fewer play-outs to be performed in several game domains where αβ-based search has had per time unit. The general practice in MCTS seems to be more towards using relatively knowledge-light play-out strategies for difficulties, in particular computer Go, but other domains the benefit of getting additional simulations done. include General Game Playing [13], Amazons [14] and Hex In this paper we show, for the game Lines of Action (LOA), [15]. that this is not necessarily the best strategy. The newest version The right tradeoff between search and knowledge equally of our simulation-based LOA program, MC-LOAαβ , uses a applies to MCTS.
    [Show full text]
  • Smartk: Efficient, Scalable and Winning Parallel MCTS
    SmartK: Efficient, Scalable and Winning Parallel MCTS Michael S. Davinroy, Shawn Pan, Bryce Wiedenbeck, and Tia Newhall Swarthmore College, Swarthmore, PA 19081 ACM Reference Format: processes. Each process constructs a distinct search tree in Michael S. Davinroy, Shawn Pan, Bryce Wiedenbeck, and Tia parallel, communicating only its best solutions at the end of Newhall. 2019. SmartK: Efficient, Scalable and Winning Parallel a number of rollouts to pick a next action. This approach has MCTS. In SC ’19, November 17–22, 2019, Denver, CO. ACM, low communication overhead, but significant duplication of New York, NY, USA, 3 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnneffort across processes. 1 INTRODUCTION Like root parallelization, our SmartK algorithm achieves low communication overhead by assigning independent MCTS Monte Carlo Tree Search (MCTS) is a popular algorithm tasks to processes. However, SmartK dramatically reduces for game playing that has also been applied to problems as search duplication by starting the parallel searches from many diverse as auction bidding [5], program analysis [9], neural different nodes of the search tree. Its key idea is it uses nodes’ network architecture search [16], and radiation therapy de- weights in the selection phase to determine the number of sign [6]. All of these domains exhibit combinatorial search processes (the amount of computation) to allocate to ex- spaces where even moderate-sized problems can easily ex- ploring a node, directing more computational effort to more haust computational resources, motivating the need for par- promising searches. allel approaches. SmartK is our novel parallel MCTS algo- Our experiments with an MPI implementation of SmartK rithm that achieves high performance on large clusters, by for playing Hex show that SmartK has a better win rate and adaptively partitioning the search space to efficiently utilize scales better than other parallelization approaches.
    [Show full text]
  • Monte-Carlo Tree Search
    Monte-Carlo Tree Search PROEFSCHRIFT ter verkrijging van de graad van doctor aan de Universiteit Maastricht, op gezag van de Rector Magnificus, Prof. mr. G.P.M.F. Mols, volgens het besluit van het College van Decanen, in het openbaar te verdedigen op donderdag 30 september 2010 om 14:00 uur door Guillaume Maurice Jean-Bernard Chaslot Promotor: Prof. dr. G. Weiss Copromotor: Dr. M.H.M. Winands Dr. B. Bouzy (Universit´e Paris Descartes) Dr. ir. J.W.H.M. Uiterwijk Leden van de beoordelingscommissie: Prof. dr. ir. R.L.M. Peeters (voorzitter) Prof. dr. M. Muller¨ (University of Alberta) Prof. dr. H.J.M. Peters Prof. dr. ir. J.A. La Poutr´e (Universiteit Utrecht) Prof. dr. ir. J.C. Scholtes The research has been funded by the Netherlands Organisation for Scientific Research (NWO), in the framework of the project Go for Go, grant number 612.066.409. Dissertation Series No. 2010-41 The research reported in this thesis has been carried out under the auspices of SIKS, the Dutch Research School for Information and Knowledge Systems. ISBN 978-90-8559-099-6 Printed by Optima Grafische Communicatie, Rotterdam. Cover design and layout finalization by Steven de Jong, Maastricht. c 2010 G.M.J-B. Chaslot. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronically, mechanically, photocopying, recording or otherwise, without prior permission of the author. Preface When I learnt the game of Go, I was intrigued by the contrast between the simplicity of its rules and the difficulty to build a strong Go program.
    [Show full text]
  • Grid Computing for Artificial Intelligence
    0 10 Grid Computing for Artificial Intelligence Yuya Dan Matsuyama University Japan 1. Introduction This chapter is concerned in grid computing for artificial intelligence systems. In general, grid computing enables us to process a huge amount of data in any fields of discipline. Scientific computing, calculation of the accurate value of π, search for Mersenne prime numbers, analysis of protein molecular structure, weather dynamics, data analysis, business data mining, simulation are examples of application for grid computing. As is well known that supercomputers have very high performance in calculation, however, it is quite expensive to use them for a long time. On the other hand, grid computing using idle resources on the Internet may be running on the reasonable cost. Shogi is a traditional game involving two players in Japan as similar as chess, which is more complicated than chess in rule of play. Figure 1 shows the set of Shogi and initial state of 40 pieces on 9 x 9 board. According to game theory, both chess and Shogi are two-person zero-sum game with perfect information. They are not only a popular game but also a target in the research of artificial intelligence. Fig. 1. Picture of Shogi set including 40 pieces ona9x9board. There are the corresponding king, rook, knight, pawn, and other pieces. Six kinds of pieces can change their role like pawns in chess when they reach the opposite territory. 202 Advances in Grid Computing The information systems for Shogi need to process astronomical combinations of possible positions to move. It is described by Dan (4) that the grid systems for Shogi is effective in reduction of computational complexity with collaboration from computational resources on the Internet.
    [Show full text]
  • CS234 Notes - Lecture 14 Model Based RL, Monte-Carlo Tree Search
    CS234 Notes - Lecture 14 Model Based RL, Monte-Carlo Tree Search Anchit Gupta, Emma Brunskill June 14, 2018 1 Introduction In this lecture we will learn about model based RL and simulation based tree search methods. So far we have seen methods which attempt to learn either a value function or a policy from experience. In contrast model based approaches first learn a model of the world from experience and then use this for planning and acting. Model-based approaches have been shown to have better sample efficiency and faster convergence in certain settings. We will also have a look at MCTS and its variants which can be used for planning given a model. MCTS was one of the main ideas behind the success of AlphaGo. Figure 1: Relationships among learning, planning, and acting 2 Model Learning By model we mean a representation of an MDP < S; A; R; T; γ> parametrized by η. In the model learning regime we assume that the state and action space S; A are known and typically we also assume conditional independence between state transitions and rewards i.e P [st+1; rt+1jst; at] = P [st+1jst; at]P [rt+1jst; at] Hence learning a model consists of two main parts the reward function R(:js; a) and the transition distribution P (:js; a). k k k k K Given a set of real trajectories fSt ;At Rt ; :::; ST gk=1 model learning can be posed as a supervised learning problem. Learning the reward function R(s; a) is a regression problem whereas learning the transition function P (s0js; a) is a density estimation problem.
    [Show full text]
  • An Analysis of Monte Carlo Tree Search
    Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) An Analysis of Monte Carlo Tree Search Steven James,∗ George Konidaris,† Benjamin Rosman∗‡ ∗University of the Witwatersrand, Johannesburg, South Africa †Brown University, Providence RI 02912, USA ‡Council for Scientific and Industrial Research, Pretoria, South Africa [email protected], [email protected], [email protected] Abstract UCT calls for rollouts to be performed by randomly se- lecting actions until a terminal state is reached. The outcome Monte Carlo Tree Search (MCTS) is a family of directed of the simulation is then propagated to the root of the tree. search algorithms that has gained widespread attention in re- cent years. Despite the vast amount of research into MCTS, Averaging these results over many iterations performs re- the effect of modifications on the algorithm, as well as the markably well, despite the fact that actions during the sim- manner in which it performs in various domains, is still not ulation are executed completely at random. As the outcome yet fully known. In particular, the effect of using knowledge- of the simulations directly affects the entire algorithm, one heavy rollouts in MCTS still remains poorly understood, with might expect that the manner in which they are performed surprising results demonstrating that better-informed rollouts has a major effect on the overall strength of the algorithm. often result in worse-performing agents. We present exper- A natural assumption to make is that completely random imental evidence suggesting that, under certain smoothness simulations are not ideal, since they do not map to realistic conditions, uniformly random simulation policies preserve actions.
    [Show full text]
  • CS 188: Artificial Intelligence Game Playing State-Of-The-Art
    CS 188: Artificial Intelligence Game Playing State-of-the-Art § Checkers: 1950: First computer player. 1994: First Adversarial Search computer champion: Chinook ended 40-year-reign of human champion Marion Tinsley using complete 8-piece endgame. 2007: Checkers solved! § Chess: 1997: Deep Blue defeats human champion Gary Kasparov in a six-game match. Deep Blue examined 200M positions per second, used very sophisticated evaluation and undisclosed methods for extending some lines of search up to 40 ply. Current programs are even better, if less historic. § Go: Human champions are now starting to be challenged by machines. In go, b > 300! Classic programs use pattern knowledge bases, but big recent advances use Monte Carlo (randomized) expansion methods. Instructors: Pieter Abbeel & Dan Klein University of California, Berkeley [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley (ai.berkeley.edu).] Game Playing State-of-the-Art Behavior from Computation § Checkers: 1950: First computer player. 1994: First computer champion: Chinook ended 40-year-reign of human champion Marion Tinsley using complete 8-piece endgame. 2007: Checkers solved! § Chess: 1997: Deep Blue defeats human champion Gary Kasparov in a six-game match. Deep Blue examined 200M positions per second, used very sophisticated evaluation and undisclosed methods for extending some lines of search up to 40 ply. Current programs are even better, if less historic. § Go: 2016: Alpha GO defeats human champion. Uses Monte Carlo Tree Search, learned
    [Show full text]
  • A Survey of Monte Carlo Tree Search Methods
    IEEE TRANSACTIONS ON COMPUTATIONAL INTELLIGENCE AND AI IN GAMES, VOL. 4, NO. 1, MARCH 2012 1 A Survey of Monte Carlo Tree Search Methods Cameron Browne, Member, IEEE, Edward Powley, Member, IEEE, Daniel Whitehouse, Member, IEEE, Simon Lucas, Senior Member, IEEE, Peter I. Cowling, Member, IEEE, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis and Simon Colton Abstract—Monte Carlo Tree Search (MCTS) is a recently proposed search method that combines the precision of tree search with the generality of random sampling. It has received considerable interest due to its spectacular success in the difficult problem of computer Go, but has also proved beneficial in a range of other domains. This paper is a survey of the literature to date, intended to provide a snapshot of the state of the art after the first five years of MCTS research. We outline the core algorithm’s derivation, impart some structure on the many variations and enhancements that have been proposed, and summarise the results from the key game and non-game domains to which MCTS methods have been applied. A number of open research questions indicate that the field is ripe for future work. Index Terms—Monte Carlo Tree Search (MCTS), Upper Confidence Bounds (UCB), Upper Confidence Bounds for Trees (UCT), Bandit-based methods, Artificial Intelligence (AI), Game search, Computer Go. F 1 INTRODUCTION ONTE Carlo Tree Search (MCTS) is a method for M finding optimal decisions in a given domain by taking random samples in the decision space and build- ing a search tree according to the results. It has already had a profound impact on Artificial Intelligence (AI) approaches for domains that can be represented as trees of sequential decisions, particularly games and planning problems.
    [Show full text]
  • Monte Carlo *-Minimax Search
    Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence Monte Carlo *-Minimax Search Marc Lanctot Abdallah Saffidine Department of Knowledge Engineering LAMSADE, Maastricht University, Netherlands Universite´ Paris-Dauphine, France [email protected] abdallah.saffi[email protected] Joel Veness and Christopher Archibald, Mark H. M. Winands Department of Computing Science Department of Knowledge Engineering University of Alberta, Canada Maastricht University, Netherlands fveness@cs., [email protected] [email protected] Abstract the search enhancements from the classic αβ literature can- not be easily adapted to MCTS. The classic algorithms for This paper introduces Monte Carlo *-Minimax stochastic games, EXPECTIMAX and *-Minimax (Star1 and Search (MCMS), a Monte Carlo search algorithm Star2), perform look-ahead searches to a limited depth. How- for turned-based, stochastic, two-player, zero-sum ever, the running time of these algorithms scales exponen- games of perfect information. The algorithm is de- tially in the branching factor at chance nodes as the search signed for the class of densely stochastic games; horizon is increased. Hence, their performance in large games that is, games where one would rarely expect to often depends heavily on the quality of the heuristic evalua- sample the same successor state multiple times at tion function, as only shallow searches are possible. any particular chance node. Our approach com- bines sparse sampling techniques from MDP plan- One way to handle the uncertainty at chance nodes would ning with classic pruning techniques developed for be forward pruning [Smith and Nau, 1993], but the perfor- adversarial expectimax planning. We compare and mance gain until now has been small [Schadd et al., 2009].
    [Show full text]
  • Shogi Yearbook 2015
    Shogi Yearbook 2015 SHOGI24.COM SHOGI YEARBOOK 2015 Title match games, Challenger’s tournaments, interviews with WATANABE Akira and HIROSE Akihito, tournament reports, photos, Micro Shogi, statistics, … This yearbook is a free PDF document Shogi Yearbook 2015 Content Content Content .................................................................................................................................................... 2 Just a few words ... .................................................................................................................................. 5 64. Osho .................................................................................................................................................. 6 64. Osho league ................................................................................................................................... 6 64th Osho title match ........................................................................................................................... 9 Game 1 ............................................................................................................................................. 9 Game 2 ........................................................................................................................................... 12 Game 3 ........................................................................................................................................... 15 Game 4 ..........................................................................................................................................
    [Show full text]
  • Using Monte Carlo Tree Search As a Demonstrator Within Asynchronous Deep RL
    Using Monte Carlo Tree Search as a Demonstrator within Asynchronous Deep RL Bilal Kartal∗, Pablo Hernandez-Leal∗ and Matthew E. Taylor fbilal.kartal,pablo.hernandez,[email protected] Borealis AI, Edmonton, Canada Abstract There are several ways to get the best of both DRL and search methods. For example, AlphaGo Zero (Silver et al. Deep reinforcement learning (DRL) has achieved great suc- 2017) and Expert Iteration (Anthony, Tian, and Barber 2017) cesses in recent years with the help of novel methods and concurrently proposed the idea of combining DRL and higher compute power. However, there are still several chal- MCTS in an imitation learning framework where both com- lenges to be addressed such as convergence to locally opti- mal policies and long training times. In this paper, firstly, ponents improve each other. These works combine search we augment Asynchronous Advantage Actor-Critic (A3C) and neural networks sequentially. First, search is used to method with a novel self-supervised auxiliary task, i.e. Ter- generate an expert move dataset which is used to train a pol- minal Prediction, measuring temporal closeness to terminal icy network (Guo et al. 2014). Second, this network is used states, namely A3C-TP. Secondly, we propose a new frame- to improve expert search quality (Anthony, Tian, and Barber work where planning algorithms such as Monte Carlo tree 2017), and this is repeated in a loop. However, expert move search or other sources of (simulated) demonstrators can be data collection by vanilla search algorithms can be slow in a integrated to asynchronous distributed DRL methods. Com- sequential framework (Guo et al.
    [Show full text]