Popart Normalization, R2D2, Muzero

Total Page:16

File Type:pdf, Size:1020Kb

Popart Normalization, R2D2, Muzero NPFL122, Lecture 11 PopArt Normalization, R2D2, MuZero Milan Straka December 14, 2020 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated PopArt Normalization An improvement of IMPALA from Sep 2018, which performs normalization of task rewards instead of just reward clipping. PopArt stands for Preserving Outputs Precisely, while Adaptively Rescaling Targets. Assume the value estimate v(s; θ, σ, μ) is computed using a normalized value predictor n(s; θ) v(s; θ, σ, μ) =def σn(s; θ) + μ and further assume that n(s; θ) is an output of a linear function n(s; θ) =def ωT f(s; θ − {ω, b}) + b. We can update the σ and μ using exponentially moving average with decay rate β (in the paper, first moment μ and second moment υ is tracked, and the standard deviation is computed as 2 ; decay rate −4 is employed). σ = υ − μ β = 3 ⋅ 10 NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 2/28 PopArt Normalization Utilizing the parameters μ and σ, we can normalize the observed (unnormalized) returns as (G − μ)/σ and use an actor-critic algorithm with advantage (G − μ)/σ − n(S; θ). However, in order to make sure the value function estimate does not change when the normalization parameters change, the parameters ω, b used to compute the value estimate v(s; θ, σ, μ) =def σ ωT f(s; θ − {ω, b}) + b + μ ( ) are updated under any change μ → μ′ and σ → σ′ as ′ σ ω ← ω, σ′ ′ ′ σb + μ − μ b ← . σ′ In multi-task settings, we train a task-agnostic policy and task-specific value functions (therefore, μ, σ and n(s; θ) are vectors). NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 3/28 PopArt Results Table 1 of paper "Multi-task Deep Reinforcement Learning with PopArt" by Matteo Hessel et al. Figures 1, 2 of paper "Multi-task Deep Reinforcement Learning with PopArt" by Matteo Hessel et al. NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 4/28 PopArt Results Figure 3 of paper "Multi-task Deep Reinforcement Learning with PopArt" by Matteo Hessel et al. Normalization statistics on chosen environments. NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 5/28 PopArt Results Figures 4, 5 of paper "Multi-task Deep Reinforcement Learning with PopArt" by Matteo Hessel et al. NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 6/28 Transformed Rewards So far, we have clipped the rewards in DQN on Atari environments. Consider a Bellman operator T def ′ ′ (T q)(s, a) = Es′ ,r∼p r + γ max q(s , a ) . [ a′ ] Instead of reducing the magnitude of rewards, we might use a function h : R → R to reduce their scale. We define a transformed Bellman operator as T h def −1 ′ ′ (T hq)(s, a) = Es′ ,r∼p h r + γ max h (q(s , a )) . [ ( a′ )] NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 7/28 Transformed Rewards It is easy to prove the following two propositions from a 2018 paper Observe and Look Further: Achieving Consistent Performance on Atari by Tobias Pohlen et al. 1. If for , then k k→∞ . h(z) = αz α > 0 T hq h ∘ q∗ = αq∗ The statement follows from the fact that it is equivalent to scaling the rewards by a constant α. 2. When h is strictly monotonically increasing and the MDP is deterministic, then k k→∞ . T hq h ∘ q∗ This second proposition follows from −1 h ∘ q∗ = h ∘ T q∗ = h ∘ T (h ∘ h ∘ q∗) = T h( h ∘ q∗) , where the last equality only holds if the MDP is deterministic. NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 8/28 Transformed Rewards The authors use the following transformation for the Atari environments def h(x) = sign(x) ∣x∣ + 1 − 1 + εx ( ) with ε = 10−2. The additive regularization term ensures that h−1 is Lipschitz continuous. It is straightforward to verify that 2 −1 1 + 4ε(∣x∣ + 1 + ε) − 1 h (x) = sign(x) ⎛ − 1⎞ . ( 2ε ) ⎝ ⎠ NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 9/28 Recurrent Replay Distributed DQN (R2D2) Proposed in 2019, to study the effects of recurrent state, experience replay and distributed training. R2D2 utilizes prioritized replay, n-step double Q-learning with n = 5, convolutional layers followed by a 512-dimensional LSTM passed to duelling architecture, generating experience by a large number of actors (256; each performing approximately 260 steps per second) and learning from batches by a single learner (achieving 5 updates per second using mini-batches of 64 sequences of length 80). Instead of individual transitions, the replay buffer consists of fixed-length (m = 80) sequences of (s, a, r), with adjacent sequences overlapping by 40 time steps. NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 10/28 Recurrent Replay Distributed DQN (R2D2) Figure 1 of the paper "Recurrent Experience Replay in Distributed Reinforcement Learning" by Steven Kapturowski et al. NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 11/28 Recurrent Replay Distributed DQN (R2D2) Table 1 of the paper "Recurrent Experience Replay in Distributed Reinforcement Learning" by Steven Kapturowski et al. Figure 2 of the paper "Recurrent Experience Replay in Distributed Reinforcement Learning" by Steven Kapturowski et al. NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 12/28 Recurrent Replay Distributed DQN (R2D2) Table 2 of the paper "Recurrent Experience Replay in Distributed Reinforcement Learning" by Steven Kapturowski et al. NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 13/28 Recurrent Replay Distributed DQN (R2D2) Figure 9 of the paper "Recurrent Experience Replay in Distributed Reinforcement Learning" by Steven Kapturowski et al. NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 14/28 Recurrent Replay Distributed DQN (R2D2) Ablations comparing the reward clipping instead of value rescaling (Clipped), smaller discount factor γ = 0.99 (Discount) and Feed-Forward variant of R2D2. Furthermore, life-loss reset evaluates resetting an episode on life loss, with roll preventing value bootstrapping (but not LSTM unrolling). Figure 4 of the paper "Recurrent Experience Replay in Distributed Reinforcement Learning" by Steven Kapturowski et al. Figure 7 of the paper "Recurrent Experience Replay in Distributed Reinforcement Learning" by Steven Kapturowski et al. NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 15/28 Utilization of LSTM Memory During Inference Figure 5 of the paper "Recurrent Experience Replay in Distributed Reinforcement Learning" by Steven Kapturowski et al. NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 16/28 MuZero The MuZero algorithm extends the AlphaZero by a trained model, alleviating the requirement for a known MDP dynamics. At each time-step , for each of steps, a model , with parameters , conditioned t 1 ≤ k ≤ K μθ θ on past observations and future actions , predicts three future o1, … , ot at+1, … , at+k quantities: the policy k , pt ≈ π(at+k+1∣ o1, … , ot, at+1, … , at+k ) the value function k , vt ≈ E[ut+k+1 + γut+k+2 + … ∣o1, … , ot, at+1, … , at+k ] the immediate reward k , rt ≈ ut+k where are the observed rewards and is the behaviour policy. ui π NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 17/28 MuZero At each time-step t (omitted from now on for simplicity), the model is composed of three components, a representation function, a dynamics function and a prediction function. The dynamics function, k k k−1 k , simulates the MDP dynamics and predicts r , s = gθ( s , a ) an immediate reward rk and an internal state sk . The internal state has no explicit semantics, its only goal is to accurately predict rewards, values and policies. The prediction function k k k , computes the policy and value function, similarly p , v = fθ( s ) as in AlphaZero. The representation function, 0 , generates an internal state encoding the s = hθ( o1, … , ot) past observations. NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 18/28 MuZero Figure 1 of the paper "Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model" by Julian Schrittwieser et al. NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 19/28 MuZero To select a move, we employ a MCTS algorithm similar to the AlphaZero. It produces a policy and value estimate , and an action is then sampled from the policy . πt νt at+1 ∼ πt During training, we utilize a sequence of k moves. We estimate the return using bootstrapping n−1 n . The values and are used in zt = ut+1 + γut+2 + … + γ ut+n + γ νt+n k = 5 n = 10 the paper. The loss is then composed of the following components: K r k v k p k 2 L ( θ) = L (u , r ) + L (z , v ) + L (π , p ) + c∥θ∥ . t ∑ t+k t t+k t t+k t k=0 Note that in Atari, rewards are scaled by for −3, and sign(x)( ∣x∣ + 1 − 1) + εx ε = 10 authors utilize a cross-entropy loss with 601 categories for values −300, … , 300, which they claim to be more stable. NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 20/28 MuZero Model 0 s = hθ( o1, ..., ot) k k k−1 k k k k 1 k r , s = gθ( s , a ) ⎪⎫ p , v , r = μθ( o1, ..., ot, a , ..., a ) k k k p , v = fθ( s ) ⎬ ⎪⎭ Search 0 νt, πt = MCTS(st , μθ) at ∼ πt NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 21/28 MuZero Learning Rule k k k pt , vt , rt = μθ( o1, … , ot, at+1, ..., at+k ) uT for games zt = n−1 n { ut+1 + γut+2 + ... + γ ut+n + γ νt+n for general MDPs K r k v k p k 2 L ( θ) = L (u , r ) + L (z , v ) + L (π , p ) + c∥θ∥ t ∑ t+k t t+k t t+k t k=0 Losses r 0 for games L (u, r) = { ϕ(u)T log r for general MDPs 2 v (z − q) for games L (z, q) = { ϕ(z)T log q for general MDPs Lp(π, p) = πT log p NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 22/28 MuZero – Evaluation Figure 2 of the paper "Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model" by Julian Schrittwieser et al.
Recommended publications
  • Improving Model-Based Reinforcement Learning with Internal State Representations Through Self-Supervision
    1 Improving Model-Based Reinforcement Learning with Internal State Representations through Self-Supervision Julien Scholz, Cornelius Weber, Muhammad Burhan Hafez and Stefan Wermter Knowledge Technology, Department of Informatics, University of Hamburg, Germany [email protected], fweber, hafez, [email protected] Abstract—Using a model of the environment, reinforcement the design decisions made for MuZero, namely, how uncon- learning agents can plan their future moves and achieve super- strained its learning process is. After explaining all necessary human performance in board games like Chess, Shogi, and Go, fundamentals, we propose two individual augmentations to while remaining relatively sample-efficient. As demonstrated by the MuZero Algorithm, the environment model can even be the algorithm that work fully unsupervised in an attempt to learned dynamically, generalizing the agent to many more tasks improve its performance and make it more reliable. After- while at the same time achieving state-of-the-art performance. ward, we evaluate our proposals by benchmarking the regular Notably, MuZero uses internal state representations derived from MuZero algorithm against the newly augmented creation. real environment states for its predictions. In this paper, we bind the model’s predicted internal state representation to the environ- II. BACKGROUND ment state via two additional terms: a reconstruction model loss and a simpler consistency loss, both of which work independently A. MuZero Algorithm and unsupervised, acting as constraints to stabilize the learning process. Our experiments show that this new integration of The MuZero algorithm [1] is a model-based reinforcement reconstruction model loss and simpler consistency loss provide a learning algorithm that builds upon the success of its prede- significant performance increase in OpenAI Gym environments.
    [Show full text]
  • Survey on Reinforcement Learning for Language Processing
    Survey on reinforcement learning for language processing V´ıctorUc-Cetina1, Nicol´asNavarro-Guerrero2, Anabel Martin-Gonzalez1, Cornelius Weber3, Stefan Wermter3 1 Universidad Aut´onomade Yucat´an- fuccetina, [email protected] 2 Aarhus University - [email protected] 3 Universit¨atHamburg - fweber, [email protected] February 2021 Abstract In recent years some researchers have explored the use of reinforcement learning (RL) algorithms as key components in the solution of various natural language process- ing tasks. For instance, some of these algorithms leveraging deep neural learning have found their way into conversational systems. This paper reviews the state of the art of RL methods for their possible use for different problems of natural language processing, focusing primarily on conversational systems, mainly due to their growing relevance. We provide detailed descriptions of the problems as well as discussions of why RL is well-suited to solve them. Also, we analyze the advantages and limitations of these methods. Finally, we elaborate on promising research directions in natural language processing that might benefit from reinforcement learning. Keywords| reinforcement learning, natural language processing, conversational systems, pars- ing, translation, text generation 1 Introduction Machine learning algorithms have been very successful to solve problems in the natural language pro- arXiv:2104.05565v1 [cs.CL] 12 Apr 2021 cessing (NLP) domain for many years, especially supervised and unsupervised methods. However, this is not the case with reinforcement learning (RL), which is somewhat surprising since in other domains, reinforcement learning methods have experienced an increased level of success with some impressive results, for instance in board games such as AlphaGo Zero [106].
    [Show full text]
  • Multi-Agent Reinforcement Learning: a Review of Challenges and Applications
    applied sciences Review Multi-Agent Reinforcement Learning: A Review of Challenges and Applications Lorenzo Canese †, Gian Carlo Cardarilli, Luca Di Nunzio †, Rocco Fazzolari † , Daniele Giardino †, Marco Re † and Sergio Spanò *,† Department of Electronic Engineering, University of Rome “Tor Vergata”, Via del Politecnico 1, 00133 Rome, Italy; [email protected] (L.C.); [email protected] (G.C.C.); [email protected] (L.D.N.); [email protected] (R.F.); [email protected] (D.G.); [email protected] (M.R.) * Correspondence: [email protected]; Tel.: +39-06-7259-7273 † These authors contributed equally to this work. Abstract: In this review, we present an analysis of the most used multi-agent reinforcement learn- ing algorithms. Starting with the single-agent reinforcement learning algorithms, we focus on the most critical issues that must be taken into account in their extension to multi-agent scenarios. The analyzed algorithms were grouped according to their features. We present a detailed taxonomy of the main multi-agent approaches proposed in the literature, focusing on their related mathematical models. For each algorithm, we describe the possible application fields, while pointing out its pros and cons. The described multi-agent algorithms are compared in terms of the most important char- acteristics for multi-agent reinforcement learning applications—namely, nonstationarity, scalability, and observability. We also describe the most common benchmark environments used to evaluate the Citation: Canese, L.; Cardarilli, G.C.; performances of the considered methods. Di Nunzio, L.; Fazzolari, R.; Giardino, D.; Re, M.; Spanò, S. Multi-Agent Keywords: machine learning; reinforcement learning; multi-agent; swarm Reinforcement Learning: A Review of Challenges and Applications.
    [Show full text]
  • Combining Off and On-Policy Training in Model-Based Reinforcement Learning
    Combining Off and On-Policy Training in Model-Based Reinforcement Learning Alexandre Borges Arlindo L. Oliveira INESC-ID / Instituto Superior Técnico INESC-ID / Instituto Superior Técnico University of Lisbon University of Lisbon Lisbon, Portugal Lisbon, Portugal ABSTRACT factor and game length. AlphaGo [1] was the first computer pro- The combination of deep learning and Monte Carlo Tree Search gram to beat a professional human player in the game of Go by (MCTS) has shown to be effective in various domains, such as combining deep neural networks with tree search. In the first train- board and video games. AlphaGo [1] represented a significant step ing phase, the system learned from expert games, but a second forward in our ability to learn complex board games, and it was training phase enabled the system to improve its performance by rapidly followed by significant advances, such as AlphaGo Zero self-play using reinforcement learning. AlphaGoZero [2] learned [2] and AlphaZero [3]. Recently, MuZero [4] demonstrated that only through self-play and was able to beat AlphaGo soundly. Alp- it is possible to master both Atari games and board games by di- haZero [3] improved on AlphaGoZero by generalizing the model rectly learning a model of the environment, which is then used to any board game. However, these methods were designed only with Monte Carlo Tree Search (MCTS) [5] to decide what move for board games, specifically for two-player zero-sum games. to play in each position. During tree search, the algorithm simu- Recently, MuZero [4] improved on AlphaZero by generalizing lates games by exploring several possible moves and then picks the it even more, so that it can learn to play singe-agent games while, action that corresponds to the most promising trajectory.
    [Show full text]
  • On Building Generalizable Learning Agents
    On Building Generalizable Learning Agents Yi Wu Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2020-9 http://www2.eecs.berkeley.edu/Pubs/TechRpts/2020/EECS-2020-9.html January 10, 2020 Copyright © 2020, by the author(s). All rights reserved. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission. On Building Generalizable Learning Agents by Yi Wu A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Stuart Russell, Chair Professor Pieter Abbeel Associate Professor Javad Lavaei Assistant Professor Aviv Tamar Fall 2019 On Building Generalizable Learning Agents Copyright 2019 by Yi Wu 1 Abstract On Building Generalizable Learning Agents by Yi Wu Doctor of Philosophy in Computer Science University of California, Berkeley Professor Stuart Russell, Chair It has been a long-standing goal in Artificial Intelligence (AI) to build machines that can solve tasks that humans can. Thanks to the recent rapid progress in data-driven methods, which train agents to solve tasks by learning from massive training data, there have been many successes in applying such learning approaches to handle and even solve a number of extremely challenging tasks, including image classification, language generation, robotics control, and several complex multi-player games.
    [Show full text]
  • On the Role of Planning in Model-Based Deep
    Published as a conference paper at ICLR 2021 ON THE ROLE OF PLANNING IN MODEL-BASED DEEP REINFORCEMENT LEARNING Jessica B. Hamrick,∗ Abram L. Friesen, Feryal Behbahani, Arthur Guez, Fabio Viola, Sims Witherspoon, Thomas Anthony, Lars Buesing, Petar Velickoviˇ c,´ Theophane´ Weber∗ DeepMind, London, UK ABSTRACT Model-based planning is often thought to be necessary for deep, careful reason- ing and generalization in artificial agents. While recent successes of model-based reinforcement learning (MBRL) with deep function approximation have strength- ened this hypothesis, the resulting diversity of model-based methods has also made it difficult to track which components drive success and why. In this paper, we seek to disentangle the contributions of recent methods by focusing on three questions: (1) How does planning benefit MBRL agents? (2) Within planning, what choices drive performance? (3) To what extent does planning improve gen- eralization? To answer these questions, we study the performance of MuZero [58], a state-of-the-art MBRL algorithm with strong connections and overlapping com- ponents with many other MBRL algorithms. We perform a number of interven- tions and ablations of MuZero across a wide range of environments, including control tasks, Atari, and 9x9 Go. Our results suggest the following: (1) Planning is most useful in the learning process, both for policy updates and for providing a more useful data distribution. (2) Using shallow trees with simple Monte-Carlo rollouts is as performant as more complex methods, except in the most difficult reasoning tasks. (3) Planning alone is insufficient to drive strong generalization. These results indicate where and how to utilize planning in reinforcement learning settings, and highlight a number of open questions for future MBRL research.
    [Show full text]
  • The Evolving Electric Power Grid -Energy Internet, Iot and AI
    The Evolving Electric Power Grid -Energy Internet, IoT and AI Di Shi AI & System Analytics GEIRI North America (GEIRINA) www.geirina.net San Jose, CA, US May 8, 2020 @CURENT Strategic Planning Meeting Outline 1. Recent Development: Energy Internet and IoT 2. The New-gen. AI Technologies 3. Challenges and Opportunities Unbalance in Resources and Load Resource 76% of coal in North and Northwest 80% of hydropower in Southwest, mainly in upper stream of Yangtze River All inland wind in Northwest Solar resources mainly in Northwest Load 70%+ of load in Central and Eastern parts Fig. West to east power transmission Transmission Distances between resource and load center reaches up to 2000+km UHVDC and UHVAC are good options to transfer huge amount of power over long distance Challenges Hybrid operation of AC and DC systems System stability, security, and reliability Fig. China power transmission framework 3 4 Wind Solar 18,000MW 30,120MW 48,200MW 19.400MW Thermal 42,200MW 32.400MW 1,000MW Hydro 8,000MW 20,000MW 2,000MW 8,000MW HVDC HVDC or HVAC 28% Investment/MW UHVDC 50% Loss/MW UHVDC Developmental Trend: Source-Grid-Load Interaction Source-Grid-load interaction proposed by SGCC Jiangsu UHVDC/UHVAC/HVDC/HVAC development in China Technical aspect: Holistic generation- transmission-demand coordination Conventional Conventional load generation Pumped- Energy storage storage Source Grid Load Renewable generation EV/PHEV Battery energy storage Power electronics interfaced load Market aspect: real-time pricing, incentives, etc The existing load control methods are of low UHVDC Fault Caused Outage in Brazil granularity, and the load response is slow.
    [Show full text]
  • Monte-Carlo Tree Search As Regularized Policy Optimization
    Monte-Carlo tree search as regularized policy optimization Jean-Bastien Grill * 1 Florent Altche´ * 1 Yunhao Tang * 1 2 Thomas Hubert 3 Michal Valko 1 Ioannis Antonoglou 3 Remi´ Munos 1 Abstract AlphaZero employs an alternative handcrafted heuristic to achieve super-human performance on board games (Silver The combination of Monte-Carlo tree search et al., 2016). Recent MCTS-based MuZero (Schrittwieser (MCTS) with deep reinforcement learning has et al., 2019) has also led to state-of-the-art results in the led to significant advances in artificial intelli- Atari benchmarks (Bellemare et al., 2013). gence. However, AlphaZero, the current state- of-the-art MCTS algorithm, still relies on hand- Our main contribution is connecting MCTS algorithms, crafted heuristics that are only partially under- in particular the highly-successful AlphaZero, with MPO, stood. In this paper, we show that AlphaZero’s a state-of-the-art model-free policy-optimization algo- search heuristics, along with other common ones rithm (Abdolmaleki et al., 2018). Specifically, we show that such as UCT, are an approximation to the solu- the empirical visit distribution of actions in AlphaZero’s tion of a specific regularized policy optimization search procedure approximates the solution of a regularized problem. With this insight, we propose a variant policy-optimization objective. With this insight, our second of AlphaZero which uses the exact solution to contribution a modified version of AlphaZero that comes this policy optimization problem, and show exper- significant performance gains over the original algorithm, imentally that it reliably outperforms the original especially in cases where AlphaZero has been observed to algorithm in multiple domains.
    [Show full text]
  • Reinforcement Learning in Stock Market
    Master Thesis Reinforcement Learning in Stock Market Author: Supervisors: Pablo Carrera Fl´orezde Qui~nones Valero Laparra P´erez-Muelas Jordi Mu~nozMar´ı Abstract Reinforcement Learning is a very active field that has seen lots of progress in the last decade, however industry is still resistant to incorporate it to the automatization of its decision process. In this work we make a revision of the fundamentals of Reinforcement Learning reaching some state-of-the-art algorithms in order to make its implementation more profitable. As a practical application of this knowledge, we propose a Reinforcement Learning approach to Stock Market from scratch, developing an original agent-environment interface and comparing the performance of different training algorithms. Finally we run a series of experiments to tune our agent and test its performance. Resumen El Reinforcement Learning es un campo muy activo que ha sufrido grandes progresos en la ´ultimadecada, sin embargo la industria todav´ıase muestra reacia a incorporarlo a la autom- atizaci´onde su proceso de toma de decisiones. En este trabajo hacemos una revisi´onde los fundamentos del Reinforcement Learning alcanzando el estado del arte en algunos de los algorith- mos desarrollados para hacer su implementaci´onm´asprometedora. Como aplicaci´onpr´acticade este conocimiento proponemos una aproximacion desde cero al Mercado de Valores usando Rein- forcement Learning, desarrollando una interfaz agent-entorno original y comparando el desempe~no de diferentes algoritmos de entrenamiento. Finalmente realizamos una serie de experimentos para tunear y testear el desempe~node nuestro agente. Resum El Reinforcement Learning ´esun camp molt actiu que ha patit grans progressos en l'´ultima d`ecada,no obstant aix`ola ind´ustriaencara es mostra poc inclinada a incorporar-ho a l'automatitzaci´o del seu proc´esde presa de decisions.
    [Show full text]
  • Playing Nondeterministic Games Through Planning with a Learned
    Under review as a conference paper at ICLR 2021 PLAYING NONDETERMINISTIC GAMES THROUGH PLANNING WITH A LEARNED MODEL Anonymous authors Paper under double-blind review ABSTRACT The MuZero algorithm is known for achieving high-level performance on tradi- tional zero-sum two-player games of perfect information such as chess, Go, and shogi, as well as visual, non-zero sum, single-player environments such as the Atari suite. Despite lacking a perfect simulator and employing a learned model of environmental dynamics, MuZero produces game-playing agents comparable to its predecessor, AlphaZero. However, the current implementation of MuZero is restricted only to deterministic environments. This paper presents Nondetermin- istic MuZero (NDMZ), an extension of MuZero for nondeterministic, two-player, zero-sum games of perfect information. Borrowing from Nondeterministic Monte Carlo Tree Search and the theory of extensive-form games, NDMZ formalizes chance as a player in the game and incorporates the chance player into the MuZero network architecture and tree search. Experiments show that NDMZ is capable of learning effective strategies and an accurate model of the game. 1 INTRODUCTION While the AlphaZero algorithm achieved superhuman performance in a variety of challenging do- mains, it relies upon a perfect simulation of the environment dynamics to perform precision plan- ning. MuZero, the newest member of the AlphaZero family, combines the advantages of planning with a learned model of its environment, allowing it to tackle problems such as the Atari suite without the advantage of a simulator. This paper presents Nondeterministic MuZero (NDMZ), an extension of MuZero to stochastic, two-player, zero-sum games of perfect information.
    [Show full text]
  • Yaklaşim Ve Uygulamalar
    YAPAY ZEKÂ YAKLAŞIM VE UYGULAMALAR OUTLOOK | EYLÜL 2020 STM THINKTECH OUTLOOK | AĞUSTOS 2020 YAPAY ZEKÂ YAKLAŞIM VE UYGULAMALAR Yapay zekâ ekonomiden sosyal hayata yaşamın her alanında temel belirleyici dinamiklerden birine dönüştü. Outlook raporumuzda yapay zekânın tüm alanlarda yarattığı değişimi dünyanın bu alanda önde gelen uzmanlarının görüşleri çerçevesinde kapsamlı bir şekilde inceledik. 3 » Yapay Zekâ Nedir? 6 » Tüm Dünyada Yaşanan Yapay Zekâ Patlaması 11 » 2019’daki Yapay Zekâ Alanındaki Gelişmeler Üzerine Düşünceler 15 » Yapay Zekâ ve Jeopolitik Yarış 18 » Yapay Zekâ ve Ekonomi 25 » Yapay Zekâ ve Eğitim 28 » Yapay Zekâ ve Sağlık 31 » Yapay Zekâ ve İklim Değişikliği 34 » Yapay Zekâ, Savunma ve Güvenlik 42 » Yapay Zekâ ve Etik Tartışmaları 48 » İyilik Gücü Olarak Yapay Zekâ 53 » Yapay Zekâya ve Geleceğine İlişkin Tartışmalar 65 » Yapay Zekâ ve COVID-19 69 » Kaynakça İşbu eserde yer alan veriler/bilgiler, yalnızca bilgi amaçlı olup, bu eserde bulunan veriler/bilgiler tavsiye, reklam ya da iş geliştirme amacına yönelik değildir. STM Savunma Teknolojileri Mühendislik ve Ticaret A.Ş. işbu eserde sunulan verilerin/ bilgilerin içeriği, güncelliği ya da doğruluğu konusunda herhangi bir taahhüde girmemekte, kullanıcı veya üçüncü kişilerin bu eserde yer alan verilere/bilgilere dayanarak gerçekleştirecekleri eylemlerden ötürü sorumluluk kabul etmemektedir. Bu eserde yer alan bilgilerin her türlü hakkı STM Savunma Teknolojileri Mühendislik ve Ticaret A.Ş’ye aittir. Yazılı izin olmaksızın işbu eserde yer alan bilgi, yazı, ifadenin bir kısmı veya tamamı, herhangi bir ortamda hiçbir şekilde yayımlanamaz, çoğaltılamaz, işlenemez. STM THINKTECH OUTLOOK | YAPAY ZEKÂ - YAKLAŞIM VE UYGULAMALAR GİRİŞ YAPAY ZEKÂ NEDİR? Yapay zekâ, ister makine öğrenmesi kullansın ister kullanmasın herhangi bir tahmin veya karar işlemini gerçekleştiren teknolojilerin genel adıdır.
    [Show full text]
  • Muesli: Combining Improvements in Policy Optimization
    Muesli: Combining Improvements in Policy Optimization Matteo Hessel * 1 Ivo Danihelka * 1 2 Fabio Viola 1 Arthur Guez 1 Simon Schmitt 1 Laurent Sifre 1 Theophane Weber 1 David Silver 1 2 Hado van Hasselt 1 Abstract atari57 median atari57 median Muesli MuZero[2021] TRPO penalty 1000% Muesli We propose a novel policy update that combines 400% PPO regularized policy optimization with model learn- PG 800% MPO ing as an auxiliary loss. The update (henceforth 300% Muesli) matches MuZero’s state-of-the-art perfor- 600% mance on Atari. Notably, Muesli does so without 200% using deep search: it acts directly with a policy 400% LASER network and has computation speed comparable 100% 200% Rainbow to model-free baselines. The Atari results are score human-normalized Median complemented by extensive ablations, and by ad- 0% 0% 0 50 100 150 200 0 50 100 150 200 ditional results on continuous control and 9x9 Go. Millions of frames Millions of frames (a) (b) Figure 1. Median human normalized score across 57 Atari 1. Introduction games. (a) Muesli and other policy updates; all these use the same IMPALA network and a moderate amount of replay data Reinforcement learning (RL) is a general formulation for the (75%). Shades denote standard errors across 5 seeds. (b) Muesli problem of sequential decision making under uncertainty, with the larger MuZero network and the high replay fraction used where a learning system (the agent) must learn to maximize by MuZero (95%), compared to the latest version of MuZero. the cumulative rewards provided by the world it is embed- These large scale runs use 2 seeds.
    [Show full text]