An Introduction to Reinforcement Learning and the Alphazero AI James Frost Data Platform Director Quorum About the Speaker

Total Page:16

File Type:pdf, Size:1020Kb

An Introduction to Reinforcement Learning and the Alphazero AI James Frost Data Platform Director Quorum About the Speaker An Introduction to Reinforcement Learning and the AlphaZero AI James Frost Data Platform Director Quorum About the speaker • James Frost • Data Platform Director at Quorum, an Edinburgh based IT consultancy • Recently completed an MSc in Data Science at Dundee University • Final year project was to build a backgammon AI influenced by techniques based on DeepMind AlphaGo. This AI achieved human Grandmaster level. Session agenda • An introduction to reinforcement learning concepts • Monte-Carlo learning • Neural networks as function approximators • Issues with reinforcement learning and Deep Neural Networks • DeepMind and AlphaGo What is reinforcement learning? Types of Machine Learning Supervised Learning: Starts with a dataset of known examples. Supervised The engine then trains “by example”. Learning Unsupervised Reinforcement Learning Learning “A” is for Apple… … not an apple! Types of Machine Learning Unsupervised Learning: Starts with a dataset where the categories might not be known and looks for patterns / similarities / clusters which Supervised may be of interest. Learning Unsupervised Reinforcement Learning Learning For example, customer segmentation or fraud investigation Types of Machine Learning Reinforcement Learning: Is based on an agent interacting with the environment, and Supervised getting feedback in the form of a reward mechanism. Learning Unsupervised Reinforcement Learning Learning How do dogs learn? All training should be reward based. Giving your dog something they really like such as food, toys or praise when they show a particular behaviour means that they are more likely to do it again. RSPCA Website Principles of reinforcement learning Agent state reward action St+1 Rt At Environment Rewards • A reward Rt is a scalar feedback signal • Indicates the value of carrying out step t Kill John Connor (+10,000) Getting to destination (+100) Falling over (-50) Taking a step (-1) Money won or lost – Win (+1) or Loss (-1) e.g. poker or stock market Agent The Agent generally has the following components: • Model – the agents representation of the environment • Policy – how the agent behaves • Value function – estimate of how good a state or action is Model • Environment state is the environments private representation • Often not visible • Model is a representation of the environment state through observation Value Functions Almost all reinforcement learning algorithms involve estimating value functions that estimate how good it is for the agent to be in a given state …the most important component of almost all reinforcement learning algorithms we consider is a method for efficiently estimating values (Sutton, 2017) Value Functions / Policy - chess A sample value function for chess might be the estimated chance of winning from that position. Chess Policy From each state, calculate all legal moves For each possible move, move to state with highest value function. OR from each state pick the move with the highest action value. This is the optimal policy. Principles of reinforcement learning 1. Accurately estimating the value function is critical for reinforcement learning But how do we do that? Technique 1 – Monte Carlo Learning • Play a large number of games at random. • Record how many times each state is seen, and how many games were won from that state (or action). This lets us know the estimated value function. • This is the evaluation step for the random policy Action values State values 0.32 0.19 0.32 -0.41 -0.54 -0.41 X O 0.19 0.48 0.19 -0.54 X -0.54 X 0.32 0.19 0.32 -0.41 -0.54 -0.41 O 41% win rate from here Now use Monte-Carlo Control • Now play another bunch of games, but this time act “greedily” with respect of the value predicted by the previous policy. • Record how many times each state is seen, and how many games were won from that state 0.07 0.08 0.07 -0.14 -0.82 -0.14 X O 0.08 0.20 0.08 -0.82 X -0.82 X 0.07 0.08 0.07 -0.14 -0.82 -0.14 O 100% win rate from here Don’t get too greedy… • However, by only picking the best moves (greedy) we sometimes miss possible moves that might be better. • So need to introduce an element of exploration. • ε-greedy learning works as follows: • With probability (1-ε) make a greedy move • With probability ε move at random. • ε is often reduced as the number of episodes increases – this is guaranteed to converge to the optimal policy. Learning cycle • Policy evaluation / policy improvement is the core concept of reinforcement learning • By acting greedily with respect to the value function we can create a new, improved policy. • Iterating this process will trend towards the optimal policy Policy improvement Optimal policy Policy evaluation Problems with large state spaces Neural Networks Unfortunately most useful problems can’t store the state for every scenario. • Chess has 10^47 states • Go has 10^170 states. • How many states to record every possible scenario for a driverless car or Terminator robot? So we need some form of function approximator. Neural networks as value approximators -4 0 0 0 -4 0 -1 0 -1 6 0 -3 5 2 . -1 6 . 2 . -3 Monte-Carlo Learning and Neural Networks • Same principles as tic-tac-toe • Play a number of games at random • Sample states (or state / action pairs) from the games, the reward that these states led to, discounted by the number of steps • Use these samples to feed into the neural network for training • Now repeat the process, but instead of random play, use the neural network to predict the best moves. Pick some moves at random (ε-greedy) Deep Neural Networks as Value Functions Deep Neural Networks (DNNs) would appear to be a great candidate for a value function approximator. However, these can suffer from the following causes of instability: • correlations present in the sequence of observations • small updates to action-values estimates (Q) may significantly change the policy Atari • Deepmind paper from 2015 • Taught a deep neural network to play Atari games, such as Breakout and Kung Fu Master • Achieved above human level in over half of the games – in some cases superhuman performance (e.g. Breakout). • The network was trained using a technique based on Q-learning but introduced two important concepts: • Experience replay • Target Q network Source: Human-level control through deep reinforcement learning. Nature, 518. https://doi.org/10.1038/nature14236 Experience Replay and target Q network Agent v2 Agent action At Random samples Environment To learn new policy reward Rt state St+1 Replay Buffer ( St, At ,Rt St+1 ), ( St-1, At-1 ,Rt-1 St ), ( St-2, At-2 ,Rt-2 St-1 ), ( St-3, At-3 ,Rt-3 St-2 ), ( St-4, At-4 ,Rt-4 St-3 ), ( St-5, At-5 ,Rt-5 St-4 ) Effect of Target Q and Experience Replay AlphaGo AlphaGo • Initially trained on a database of expert human games • Then self play • After months of training beat Lee Sedol • AlphaGoZero was trained from completely random play • Within 36 hours of training AlphaZero beat AlphaGoLee 100-0 “Each time you put knowledge into a system you are actually handicapping it” David Silver (Silver, 2015) AlphaGoZero DeepMind’s AlphaGoZero project applied the following principles • only uses the raw board position as input features • use a simple Monte-Carlo Tree Search (MCTS) to evaluate positions and sample moves • residual neural network architecture • dual-headed network Monte-Carlo tree search Source: Mastering the game of Go without human knowledge. https://doi.org/doi:10.1038/nature24270 AlphaZero Success • Within 4 hours had mastered chess from first principles, beating the worlds then greatest chess engine. • Within 2 hours had beaten Elmo at Shogi. • (running on 5,000 TPUs) What is most important? Data OR Algorithm Reinforcement learning concepts summary 1. Value estimation is critical to the success of a reinforcement learning algorithm. 2. Monte-Carlo learning is a relatively simple technique to get started with and can be applied to a wide range of problems 3. Balancing exploration vs exploitation critical 4. Deep neural networks can be unstable with reinforcement learning. Deep Q Networks with Experience Replay can help stabilise this. Where next • The re-enforcement learning community are now focussing on games with imperfect information and much deeper strategies: • No limit Texas Hold’em Poker – Deepstack • StarCraft2 – AlphaStar • DOTA2 – OpenAI 5 • OpenSpiel • Ultimately the aim is to apply deep learning to real world scenarios: • Energy efficiency • Self driving cars • Protein folding • Medical diagnosis • Materials research • Artificial General Intelligence References / Useful links Netflix AlphaGo documentary Silver, D. (2015) UCL course on RL. Sutton, R. S. and Barto, A. G. (2017) Reinforcement Learning: An Introduction. OpenSpiel - https://github.com/deepmind/open_spiel Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go without human knowledge Deepmind (2018) AlphaZero: Shedding new light on the grand games of chess, shogi and Go, deepmind.com. Thank You Quorum Network Resources Ltd 18 Greenside Lane, Edinburgh, EH1 3AH www.qnrl.com | [email protected] Reg. No. SC 196645, Registered Office: 18 Greenside Lane, Edinburgh, EH1 3AH Quorum Confidential.
Recommended publications
  • Artificial Intelligence in Health Care: the Hope, the Hype, the Promise, the Peril
    Artificial Intelligence in Health Care: The Hope, the Hype, the Promise, the Peril Michael Matheny, Sonoo Thadaney Israni, Mahnoor Ahmed, and Danielle Whicher, Editors WASHINGTON, DC NAM.EDU PREPUBLICATION COPY - Uncorrected Proofs NATIONAL ACADEMY OF MEDICINE • 500 Fifth Street, NW • WASHINGTON, DC 20001 NOTICE: This publication has undergone peer review according to procedures established by the National Academy of Medicine (NAM). Publication by the NAM worthy of public attention, but does not constitute endorsement of conclusions and recommendationssignifies that it is the by productthe NAM. of The a carefully views presented considered in processthis publication and is a contributionare those of individual contributors and do not represent formal consensus positions of the authors’ organizations; the NAM; or the National Academies of Sciences, Engineering, and Medicine. Library of Congress Cataloging-in-Publication Data to Come Copyright 2019 by the National Academy of Sciences. All rights reserved. Printed in the United States of America. Suggested citation: Matheny, M., S. Thadaney Israni, M. Ahmed, and D. Whicher, Editors. 2019. Artificial Intelligence in Health Care: The Hope, the Hype, the Promise, the Peril. NAM Special Publication. Washington, DC: National Academy of Medicine. PREPUBLICATION COPY - Uncorrected Proofs “Knowing is not enough; we must apply. Willing is not enough; we must do.” --GOETHE PREPUBLICATION COPY - Uncorrected Proofs ABOUT THE NATIONAL ACADEMY OF MEDICINE The National Academy of Medicine is one of three Academies constituting the Nation- al Academies of Sciences, Engineering, and Medicine (the National Academies). The Na- tional Academies provide independent, objective analysis and advice to the nation and conduct other activities to solve complex problems and inform public policy decisions.
    [Show full text]
  • Game Changer
    Matthew Sadler and Natasha Regan Game Changer AlphaZero’s Groundbreaking Chess Strategies and the Promise of AI New In Chess 2019 Contents Explanation of symbols 6 Foreword by Garry Kasparov �������������������������������������������������������������������������������� 7 Introduction by Demis Hassabis 11 Preface 16 Introduction ������������������������������������������������������������������������������������������������������������ 19 Part I AlphaZero’s history . 23 Chapter 1 A quick tour of computer chess competition 24 Chapter 2 ZeroZeroZero ������������������������������������������������������������������������������ 33 Chapter 3 Demis Hassabis, DeepMind and AI 54 Part II Inside the box . 67 Chapter 4 How AlphaZero thinks 68 Chapter 5 AlphaZero’s style – meeting in the middle 87 Part III Themes in AlphaZero’s play . 131 Chapter 6 Introduction to our selected AlphaZero themes 132 Chapter 7 Piece mobility: outposts 137 Chapter 8 Piece mobility: activity 168 Chapter 9 Attacking the king: the march of the rook’s pawn 208 Chapter 10 Attacking the king: colour complexes 235 Chapter 11 Attacking the king: sacrifices for time, space and damage 276 Chapter 12 Attacking the king: opposite-side castling 299 Chapter 13 Attacking the king: defence 321 Part IV AlphaZero’s
    [Show full text]
  • Transfer Learning Between RTS Combat Scenarios Using Component-Action Deep Reinforcement Learning
    Transfer Learning Between RTS Combat Scenarios Using Component-Action Deep Reinforcement Learning Richard Kelly and David Churchill Department of Computer Science Memorial University of Newfoundland St. John’s, NL, Canada [email protected], [email protected] Abstract an enormous financial investment in hardware for training, using over 80000 CPU cores to run simultaneous instances Real-time Strategy (RTS) games provide a challenging en- of StarCraft II, 1200 Tensor Processor Units (TPUs) to train vironment for AI research, due to their large state and ac- the networks, as well as a large amount of infrastructure and tion spaces, hidden information, and real-time gameplay. Star- Craft II has become a new test-bed for deep reinforcement electricity to drive this large-scale computation. While Al- learning systems using the StarCraft II Learning Environment phaStar is estimated to be the strongest existing RTS AI agent (SC2LE). Recently the full game of StarCraft II has been ap- and was capable of beating many players at the Grandmas- proached with a complex multi-agent reinforcement learning ter rank on the StarCraft II ladder, it does not yet play at the (RL) system, however this is currently only possible with ex- level of the world’s best human players (e.g. in a tournament tremely large financial investments out of the reach of most setting). The creation of AlphaStar demonstrated that using researchers. In this paper we show progress on using varia- deep learning to tackle RTS AI is a powerful solution, how- tions of easier to use RL techniques, modified to accommo- ever applying it to the entire game as a whole is not an eco- date actions with multiple components used in the SC2LE.
    [Show full text]
  • Monte-Carlo Tree Search As Regularized Policy Optimization
    Monte-Carlo tree search as regularized policy optimization Jean-Bastien Grill * 1 Florent Altche´ * 1 Yunhao Tang * 1 2 Thomas Hubert 3 Michal Valko 1 Ioannis Antonoglou 3 Remi´ Munos 1 Abstract AlphaZero employs an alternative handcrafted heuristic to achieve super-human performance on board games (Silver The combination of Monte-Carlo tree search et al., 2016). Recent MCTS-based MuZero (Schrittwieser (MCTS) with deep reinforcement learning has et al., 2019) has also led to state-of-the-art results in the led to significant advances in artificial intelli- Atari benchmarks (Bellemare et al., 2013). gence. However, AlphaZero, the current state- of-the-art MCTS algorithm, still relies on hand- Our main contribution is connecting MCTS algorithms, crafted heuristics that are only partially under- in particular the highly-successful AlphaZero, with MPO, stood. In this paper, we show that AlphaZero’s a state-of-the-art model-free policy-optimization algo- search heuristics, along with other common ones rithm (Abdolmaleki et al., 2018). Specifically, we show that such as UCT, are an approximation to the solu- the empirical visit distribution of actions in AlphaZero’s tion of a specific regularized policy optimization search procedure approximates the solution of a regularized problem. With this insight, we propose a variant policy-optimization objective. With this insight, our second of AlphaZero which uses the exact solution to contribution a modified version of AlphaZero that comes this policy optimization problem, and show exper- significant performance gains over the original algorithm, imentally that it reliably outperforms the original especially in cases where AlphaZero has been observed to algorithm in multiple domains.
    [Show full text]
  • Towards Incremental Agent Enhancement for Evolving Games
    Evaluating Reinforcement Learning Algorithms For Evolving Military Games James Chao*, Jonathan Sato*, Crisrael Lucero, Doug S. Lange Naval Information Warfare Center Pacific *Equal Contribution ffi[email protected] Abstract games in 2013 (Mnih et al. 2013), Google DeepMind devel- oped AlphaGo (Silver et al. 2016) that defeated world cham- In this paper, we evaluate reinforcement learning algorithms pion Lee Sedol in the game of Go using supervised learning for military board games. Currently, machine learning ap- and reinforcement learning. One year later, AlphaGo Zero proaches to most games assume certain aspects of the game (Silver et al. 2017b) was able to defeat AlphaGo with no remain static. This methodology results in a lack of algorithm robustness and a drastic drop in performance upon chang- human knowledge and pure reinforcement learning. Soon ing in-game mechanics. To this end, we will evaluate general after, AlphaZero (Silver et al. 2017a) generalized AlphaGo game playing (Diego Perez-Liebana 2018) AI algorithms on Zero to be able to play more games including Chess, Shogi, evolving military games. and Go, creating a more generalized AI to apply to differ- ent problems. In 2018, OpenAI Five used five Long Short- term Memory (Hochreiter and Schmidhuber 1997) neural Introduction networks and a Proximal Policy Optimization (Schulman et al. 2017) method to defeat a professional DotA team, each AlphaZero (Silver et al. 2017a) described an approach that LSTM acting as a player in a team to collaborate and achieve trained an AI agent through self-play to achieve super- a common goal. AlphaStar used a transformer (Vaswani et human performance.
    [Show full text]
  • Alphastar: an Evolutionary Computation Perspective GECCO ’19 Companion, July 13–17, 2019, Prague, Czech Republic
    AlphaStar: An Evolutionary Computation Perspective Kai Arulkumaran Antoine Cully Julian Togelius Imperial College London Imperial College London New York University London, United Kingdom London, United Kingdom New York City, NY, United States [email protected] [email protected] [email protected] ABSTRACT beat a grandmaster at StarCraft (SC), a real-time strategy game. In January 2019, DeepMind revealed AlphaStar to the world—the Both the original game, and its sequel SC II, have several prop- first artificial intelligence (AI) system to beat a professional player erties that make it considerably more challenging than even Go: at the game of StarCraft II—representing a milestone in the progress real-time play, partial observability, no single dominant strategy, of AI. AlphaStar draws on many areas of AI research, including complex rules that make it hard to build a fast forward model, and deep learning, reinforcement learning, game theory, and evolution- a particularly large and varied action space. ary computation (EC). In this paper we analyze AlphaStar primar- DeepMind recently took a considerable step towards this grand ily through the lens of EC, presenting a new look at the system and challenge with AlphaStar, a neural-network-based AI system that relating it to many concepts in the field. We highlight some of its was able to beat a professional SC II player in December 2018 [20]. most interesting aspects—the use of Lamarckian evolution, com- This system, like its predecessor AlphaGo, was initially trained us- petitive co-evolution, and quality diversity. In doing so, we hope ing imitation learning to mimic human play, and then improved to provide a bridge between the wider EC community and one of through a combination of reinforcement learning (RL) and self- the most significant AI systems developed in recent times.
    [Show full text]
  • Efficiently Mastering the Game of Nogo with Deep Reinforcement
    electronics Article Efficiently Mastering the Game of NoGo with Deep Reinforcement Learning Supported by Domain Knowledge Yifan Gao 1,*,† and Lezhou Wu 2,† 1 College of Medicine and Biological Information Engineering, Northeastern University, Liaoning 110819, China 2 College of Information Science and Engineering, Northeastern University, Liaoning 110819, China; [email protected] * Correspondence: [email protected] † These authors contributed equally to this work. Abstract: Computer games have been regarded as an important field of artificial intelligence (AI) for a long time. The AlphaZero structure has been successful in the game of Go, beating the top professional human players and becoming the baseline method in computer games. However, the AlphaZero training process requires tremendous computing resources, imposing additional difficulties for the AlphaZero-based AI. In this paper, we propose NoGoZero+ to improve the AlphaZero process and apply it to a game similar to Go, NoGo. NoGoZero+ employs several innovative features to improve training speed and performance, and most improvement strategies can be transferred to other nonspecific areas. This paper compares it with the original AlphaZero process, and results show that NoGoZero+ increases the training speed to about six times that of the original AlphaZero process. Moreover, in the experiment, our agent beat the original AlphaZero agent with a score of 81:19 after only being trained by 20,000 self-play games’ data (small in quantity compared with Citation: Gao, Y.; Wu, L. Efficiently 120,000 self-play games’ data consumed by the original AlphaZero). The NoGo game program based Mastering the Game of NoGo with on NoGoZero+ was the runner-up in the 2020 China Computer Game Championship (CCGC) with Deep Reinforcement Learning limited resources, defeating many AlphaZero-based programs.
    [Show full text]
  • Long-Term Planning and Situational Awareness in Openai Five
    Long-Term Planning and Situational Awareness in OpenAI Five Jonathan Raiman∗ Susan Zhang∗ Filip Wolski Dali OpenAI OpenAI [email protected] [email protected] [email protected] Abstract Understanding how knowledge about the world is represented within model-free deep reinforcement learning methods is a major challenge given the black box nature of its learning process within high-dimensional observation and action spaces. AlphaStar and OpenAI Five have shown that agents can be trained without any explicit hierarchical macro-actions to reach superhuman skill in games that require taking thousands of actions before reaching the final goal. Assessing the agent’s plans and game understanding becomes challenging given the lack of hierarchy or explicit representations of macro-actions in these models, coupled with the incomprehensible nature of the internal representations. In this paper, we study the distributed representations learned by OpenAI Five to investigate how game knowledge is gradually obtained over the course of training. We also introduce a general technique for learning a model from the agent’s hidden states to identify the formation of plans and subgoals. We show that the agent can learn situational similarity across actions, and find evidence of planning towards accomplishing subgoals minutes before they are executed. We perform a qualitative analysis of these predictions during the games against the DotA 2 world champions OG in April 2019. 1 Introduction The choice of action and plan representation has dramatic consequences on the ability for an agent to explore, learn, or generalize when trying to accomplish a task. Inspired by how humans methodically organize and plan for long-term goals, Hierarchical Reinforcement Learning (HRL) methods were developed in an effort to augment the set of actions available to the agent to include temporally extended multi-action subroutines.
    [Show full text]
  • ELF Opengo: an Analysis and Open Reimplementation of Alphazero
    ELF OpenGo: An Analysis and Open Reimplementation of AlphaZero Yuandong Tian 1 Jerry Ma * 1 Qucheng Gong * 1 Shubho Sengupta * 1 Zhuoyuan Chen 1 James Pinkerton 1 C. Lawrence Zitnick 1 Abstract However, these advances in playing ability come at signifi- The AlphaGo, AlphaGo Zero, and AlphaZero cant computational expense. A single training run requires series of algorithms are remarkable demonstra- millions of selfplay games and days of training on thousands tions of deep reinforcement learning’s capabili- of TPUs, which is an unattainable level of compute for the ties, achieving superhuman performance in the majority of the research community. When combined with complex game of Go with progressively increas- the unavailability of code and models, the result is that the ing autonomy. However, many obstacles remain approach is very difficult, if not impossible, to reproduce, in the understanding of and usability of these study, improve upon, and extend. promising approaches by the research commu- In this paper, we propose ELF OpenGo, an open-source nity. Toward elucidating unresolved mysteries reimplementation of the AlphaZero (Silver et al., 2018) and facilitating future research, we propose ELF algorithm for the game of Go. We then apply ELF OpenGo OpenGo, an open-source reimplementation of the toward the following three additional contributions. AlphaZero algorithm. ELF OpenGo is the first open-source Go AI to convincingly demonstrate First, we train a superhuman model for ELF OpenGo. Af- superhuman performance with a perfect (20:0) ter running our AlphaZero-style training software on 2,000 record against global top professionals. We ap- GPUs for 9 days, our 20-block model has achieved super- ply ELF OpenGo to conduct extensive ablation human performance that is arguably comparable to the 20- studies, and to identify and analyze numerous in- block models described in Silver et al.(2017) and Silver teresting phenomena in both the model training et al.(2018).
    [Show full text]
  • Download This PDF File
    Vol. 10, No. 1 (2019) http://www.eludamos.org Alive: A Case Study of the Design of an AI Conversation Simulator Eric Walsh Eludamos. Journal for Computer Game Culture. 2019; 10 (1), pp. 161–181 Alive: A Case Study of the Design of an AI Conversation Simulator ERIC WALSH On December 19, 2018, DeepMind’s AlphaStar became the first artificial intelligence (AI) to defeat a top-level StarCraft II professional player (AlphaStar Team 2019). StarCraft II (Blizzard Entertainment 2010) is a real-time strategy game where players must balance developing their base and building up an economy with producing units and using those units to attack their opponent. Like chess, matches typically take place between two players and last until one player has been defeated. AlphaStar was trained to play StarCraft II using a combination of supervised learning (i. e., humans providing replays of past games for it to study) and reinforcement learning (i. e., the AI playing games against other versions of itself to hone its skills) (Dickson 2019). The StarCraft series has long been of interest to AI developers looking to test their AI’s mettle; in 2017, Blizzard Entertainment partnered with DeepMind to release a set of tools designed to “accelerate AI research in the real-time strategy game” (Vinyals, Gaffney, and Ewalds 2017, n.p.). Games like StarCraft II are often useful for AI development due to the challenges inherent in the complexity of their decision- making process. In this case, such challenges included the need to interpret imperfect information and the need to make numerous decisions simultaneously in real time (AlphaStar Team 2019).
    [Show full text]
  • V-MPO: On-Policy Maximum a Posteriori Policy Optimization For
    Preprint V-MPO: ON-POLICY MAXIMUM A POSTERIORI POLICY OPTIMIZATION FOR DISCRETE AND CONTINUOUS CONTROL H. Francis Song,∗ Abbas Abdolmaleki,∗ Jost Tobias Springenberg, Aidan Clark, Hubert Soyer, Jack W. Rae, Seb Noury, Arun Ahuja, Siqi Liu, Dhruva Tirumala, Nicolas Heess, Dan Belov, Martin Riedmiller, Matthew M. Botvinick DeepMind, London, UK fsongf,aabdolmaleki,springenberg,aidanclark, soyer,jwrae,snoury,arahuja,liusiqi,dhruvat, heess,danbelov,riedmiller,[email protected] ABSTRACT Some of the most successful applications of deep reinforcement learning to chal- lenging domains in discrete and continuous control have used policy gradient methods in the on-policy setting. However, policy gradients can suffer from large variance that may limit performance, and in practice require carefully tuned entropy regularization to prevent policy collapse. As an alternative to policy gradient algo- rithms, we introduce V-MPO, an on-policy adaptation of Maximum a Posteriori Policy Optimization (MPO) that performs policy iteration based on a learned state- value function. We show that V-MPO surpasses previously reported scores for both the Atari-57 and DMLab-30 benchmark suites in the multi-task setting, and does so reliably without importance weighting, entropy regularization, or population-based tuning of hyperparameters. On individual DMLab and Atari levels, the proposed algorithm can achieve scores that are substantially higher than has previously been reported. V-MPO is also applicable to problems with high-dimensional, continuous action spaces, which we demonstrate in the context of learning to control simulated humanoids with 22 degrees of freedom from full state observations and 56 degrees of freedom from pixel observations, as well as example OpenAI Gym tasks where V-MPO achieves substantially higher asymptotic scores than previously reported.
    [Show full text]
  • Koneoppimisen Hyödyntäminen Pelikehityksessä 0 Liitesivua
    Antti Leinonen KONEOPPIMISEN HYÖDYNTÄMINEN PELIKEHITYKSESSÄ Opinnäytetyö Tietojenkäsittely 2020 Tekijä/Tekijät Tutkinto Aika Antti Leinonen Tradenomi (AMK) Toukokuu 2020 Opinnäytetyön nimi 29 sivua Koneoppimisen hyödyntäminen pelikehityksessä 0 liitesivua Toimeksiantaja - Ohjaaja Jukka Selin Tiivistelmä Koneoppiminen on viimevuosina noussut yhdeksi kysytyimmistä osa-alueista tietotekniikan parissa. Opinnäytetyössäni käsittelen mahdollisuuksia pelikehityksen tiimoilta. Työn tavoite oli selvittää, onko Unityn Machine Learning Agents -paketilla järkevää rakentaa strategiapai- niotteisen RPG-pelin tekoäly. Työn teoriaosuudessa kerron pelikehityksen perusteista sekä sen parissa tarvittavia käsit- teitä ja konsepteja. Käsittelen tarkemmin AI:n osuutta peleissä, mitä järjestelmiä se vaatii vuorovaikutukseen pelimaailman kanssa sekä millaisia menetelmiä ja konsepteja sen luo- misessa normaalisti käytetään. Käytännönosuudessa selostan vaiheita, jotka koneoppimisen ympäristön asentaminen, sekä sen implementointi Unityllä luotuun peliin vaati. Testien yhteydessä pohdin, mitä hyö- tyjä ja haittoja koneoppimisesta on kyseistä peliä varten sekä minkä tyyppisiin peleihin ko- neoppiminen voisi parhaiten soveltua. Työn tuloksena on toimiva koneoppimisympäristö, jossa python ympäristössä ajettava Machine Learning Agents voi vuorovaikuttaa aiemmin Unitylla luomamme pelin kanssa. Ympäristön avulla koulutusta voisi jatkaa pelikehityksen edetessä. Testailun perusteella kävi kuitenkin ilmi, että muutaman tapaustamme varten kriittisen rajoituksen sekä teknisen
    [Show full text]