Approximate Universal Artificial Intelligence And

Total Page:16

File Type:pdf, Size:1020Kb

Approximate Universal Artificial Intelligence And APPROXIMATEUNIVERSALARTIFICIAL INTELLIGENCE AND SELF-PLAY LEARNING FORGAMES Doctor of Philosophy Dissertation School of Computer Science and Engineering joel veness supervisors Kee Siong Ng Marcus Hutter Alan Blair William Uther John Lloyd January 2011 Joel Veness: Approximate Universal Artificial Intelligence and Self-play Learning for Games, Doctor of Philosophy Disserta- tion, © January 2011 When we write programs that learn, it turns out that we do and they don’t. — Alan Perlis ABSTRACT This thesis is split into two independent parts. The first is an investigation of some practical aspects of Marcus Hutter’s Uni- versal Artificial Intelligence theory [29]. The main contributions are to show how a very general agent can be built and analysed using the mathematical tools of this theory. Before the work presented in this thesis, it was an open question as to whether this theory was of any relevance to reinforcement learning practitioners. This work suggests that it is indeed relevant and worthy of future investigation. The second part of this thesis looks at self-play learning in two player, determin- istic, adversarial turn-based games. The main contribution is the introduction of a new technique for training the weights of a heuristic evaluation function from data collected by classical game tree search algorithms. This method is shown to outperform previous self-play training routines based on Temporal Difference learning when applied to the game of Chess. In particular, the main highlight was using this technique to construct a Chess program that learnt to play master level Chess by tuning a set of initially random weights from self play games. iii PUBLICATIONS A significant portion of the technical content of this thesis has been previously published at leading international Artificial Intelligence conferences and peer- reviewed journals. The relevant material for Part I includes: • Reinforcement Learning via AIXI Approximation,[82] Joel Veness, Kee Siong Ng, Marcus Hutter, David Silver Association for the Advancement of Artificial Intelligence (AAAI), 2010. • A Monte-Carlo AIXI Approximation,[83] Joel Veness, Kee Siong Ng, Marcus Hutter, William Uther, David Silver Journal of Artificial Intelligence Research (JAIR), 2010. The relevant material for Part II includes: • Bootstrapping from Game Tree Search [81] Joel Veness, David Silver, Will Uther, Alan Blair Neural Information Processing Systems (NIPS), 2009. v We should not only use the brains we have, but all that we can borrow. — Woodrow Wilson ACKNOWLEDGMENTS Special thanks to Kee Siong for all his time, dedication and encouragement. This work would not have been possible without him. Thanks to Marcus for having the courage to both write his revolutionary book and take me on late as a PhD student. Thanks to Alan, John and Will for many helpful discussions and suggestions. Finally, a collective thanks to all of my supervisors for letting me pursue my own interests. Thankyou to UNSW and NICTA for the financial support that allowed me to write this thesis and attend a number of overseas conferences. Thankyou to Peter Cheeseman for giving me my first artificial intelligence job. Thanks to my external collaborators, in particular David Silver and Shane Legg. Thankyou to Michael Bowling and the University of Alberta for letting me finish my thesis on campus. Thankyou to the external examiners for their constructive feedback. Thankyou to the international cricket community for years of entertainment. Thankyou to the few primary, secondary and university teachers who kept me interested. Thankyou to the science fiction community for being a significant source of inspiration, in particular Philip K. Dick, Ursula Le Guin, Frederik Pohl, Robert Heinlein and Vernor Vinge. Thankyou to my family, and in particular my mother, who in spite of significant setbacks managed to ultimately do a good job. Thankyou to my friends for all the good times. Thankyou to my fiancée Felicity for her love and support. Finally, thanks to everyone who gave a word or two of encouragement along the way. vii DECLARATION I hereby declare that this submission is my own work and to the best of my knowl- edge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at UNSW or any other educational institution, except where due acknowledgment is made in the thesis. Any contribution made to the research by others, with whom I have worked at UNSW or elsewhere, is explicitly acknowledged in the thesis. I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project’s design and conception or in style, presentation and linguistic expression is acknowledged. Joel Veness CONTENTS i approximate universal artificial intelligence1 1 reinforcement learning via aixi approximation3 1.1 Overview 3 1.2 Introduction 4 1.2.1 The General Reinforcement Learning Problem 4 1.2.2 The AIXI Agent 5 1.2.3 AIXI as a Principle 6 1.2.4 Approximating AIXI 6 1.3 The Agent Setting 7 1.3.1 Agent Setting 7 1.3.2 Reward, Policy and Value Functions 8 1.4 Bayesian Agents 10 1.4.1 Prediction with a Mixture Environment Model 11 1.4.2 Theoretical Properties 12 1.4.3 AIXI: The Universal Bayesian Agent 14 1.4.4 Direct AIXI Approximation 14 2 expectimax approximation 16 2.1 Background 16 2.2 Overview 17 2.3 Action Selection at Decision Nodes 19 2.4 Chance Nodes 20 2.5 Estimating Future Reward at Leaf Nodes 21 2.6 Reward Backup 21 2.7 Pseudocode 22 xi xii contents 2.8 Consistency of ρUCT 25 2.9 Parallel Implementation of ρUCT 25 3 model class approximation 26 3.1 Context Tree Weighting 26 3.1.1 Krichevsky-Trofimov Estimator 27 3.1.2 Prediction Suffix Trees 28 3.1.3 Action-conditional PST 30 3.1.4 A Prior on Models of PSTs 31 3.1.5 Context Trees 32 3.1.6 Weighted Probabilities 34 3.1.7 Action Conditional CTW as a Mixture Environment Model 35 3.2 Incorporating Type Information 36 3.3 Convergence to the True Environment 38 3.4 Summary 41 3.5 Relationship to AIXI 42 4 putting it all together 43 4.1 Convergence of Value 43 4.2 Convergence to Optimal Policy 45 4.3 Computational Properties 48 4.4 Efficient Combination of FAC-CTW with ρUCT 48 4.5 Exploration/Exploitation in Practice 49 4.6 Top-level Algorithm 50 5 results 51 5.1 Empirical Results 51 5.1.1 Domains 51 5.1.2 Experimental Setup 57 5.1.3 Results 61 contents xiii 5.1.4 Discussion 64 5.1.5 Comparison to 1-ply Rollout Planning 65 5.1.6 Performance on a Challenging Domain 67 5.2 Discussion 69 5.2.1 Related Work 69 5.2.2 Limitations 71 6 future work 73 6.1 Future Scalability 73 6.1.1 Online Learning of Rollout Policies for ρUCT 73 6.1.2 Combining Mixture Environment Models 75 6.1.3 Richer Notions of Context for FAC-CTW 75 6.1.4 Incorporating CTW Extensions 76 6.1.5 Parallelization of ρUCT 77 6.1.6 Predicting at Multiple Levels of Abstraction 77 6.2 Conclusion 77 6.3 Closing Remarks 78 ii learning from self-play using game tree search 79 7 bootstrapping from game tree search 81 7.1 Overview 81 7.2 Introduction 82 7.3 Background 83 7.4 Minimax Search Bootstrapping 86 7.5 Alpha-Beta Search Bootstrapping 88 7.5.1 Updating Parameters in TreeStrap(αβ) 90 7.5.2 The TreeStrap(αβ) algorithm 90 7.6 Learning Chess Program 91 7.7 Experimental Results 92 7.7.1 Relative Performance Evaluation 93 xiv contents 7.7.2 Evaluation by Internet Play 95 7.8 Conclusion 97 bibliography 99 LISTOFFIGURES Figure 1 A ρUCT search tree 19 Figure 2 An example prediction suffix tree 29 Figure 3 A depth-2 context tree (left). Resultant trees after processing one (middle) and two (right) bits respectively. 33 Figure 4 The MC-AIXI agent loop 50 Figure 5 The cheese maze 53 Figure 6 A screenshot (converted to black and white) of the PacMan domain 56 Figure 7 Average Reward per Cycle vs Experience 63 Figure 8 Performance versus ρUCT search effort 64 Figure 9 Online performance on a challenging domain 67 Figure 10 Scaling properties on a challenging domain 68 Figure 11 Online performance when using a learnt rollout policy on the Cheese Maze 74 Figure 12 Left: TD, TD-Root and TD-Leaf backups. Right: Root- Strap(minimax) and TreeStrap(minimax). 84 Figure 13 Performance when trained via self-play starting from ran- dom initial weights. 95% confidence intervals are marked at each data point. The x-axis uses a logarithmic scale. 94 xv LISTOFTABLES Table 1 Domain characteristics 52 Table 2 Binary encoding of the domains 58 Table 3 MC-AIXI(fac-ctw) model learning configuration 59 Table 4 U-Tree model learning configuration 61 Table 5 Resources required for (near) optimal performance by MC- AIXI(fac-ctw) 62 Table 6 Average reward per cycle: ρUCT versus 1-ply rollout plan- ning 66 Table 7 Backups for various learning algorithms. 87 Table 8 Best performance when trained by self play. 95% confidence intervals given. 95 Table 9 Blitz performance at the Internet Chess Club 96 xvi Part I APPROXIMATEUNIVERSALARTIFICIAL INTELLIGENCE Beware the Turing tar-pit, where everything is possible but nothing of interest is easy. — Alan Perlis 1 REINFORCEMENTLEARNINGVIAAIXIAPPROXIMATION 1.1 overview This part of the thesis introduces a principled approach for the design of a scalable general reinforcement learning agent.
Recommended publications
  • Understanding Agent Incentives Using Causal Influence Diagrams
    Understanding Agent Incentives using Causal Influence Diagrams∗ Part I: Single Decision Settings Tom Everitt Pedro A. Ortega Elizabeth Barnes Shane Legg September 9, 2019 Deepmind Agents are systems that optimize an objective function in an environ- ment. Together, the goal and the environment induce secondary objectives, incentives. Modeling the agent-environment interaction using causal influ- ence diagrams, we can answer two fundamental questions about an agent’s incentives directly from the graph: (1) which nodes can the agent have an incentivize to observe, and (2) which nodes can the agent have an incentivize to control? The answers tell us which information and influence points need extra protection. For example, we may want a classifier for job applications to not use the ethnicity of the candidate, and a reinforcement learning agent not to take direct control of its reward mechanism. Different algorithms and training paradigms can lead to different causal influence diagrams, so our method can be used to identify algorithms with problematic incentives and help in designing algorithms with better incentives. Contents 1. Introduction 2 2. Background 3 3. Observation Incentives 7 4. Intervention Incentives 13 arXiv:1902.09980v6 [cs.AI] 6 Sep 2019 5. Related Work 19 6. Limitations and Future Work 22 7. Conclusions 23 References 24 A. Representing Uncertainty 29 B. Proofs 30 ∗ A number of people have been essential in preparing this paper. Ryan Carey, Eric Langlois, Michael Bowling, Tim Genewein, James Fox, Daniel Filan, Ray Jiang, Silvia Chiappa, Stuart Armstrong, Paul Christiano, Mayank Daswani, Ramana Kumar, Jonathan Uesato, Adria Garriga, Richard Ngo, Victoria Krakovna, Allan Dafoe, and Jan Leike have all contributed through thoughtful discussions and/or by reading drafts at various stages of this project.
    [Show full text]
  • Arxiv:2010.12268V1 [Cs.LG] 23 Oct 2020 Trained on a New Target Task Rapidly Loses in Its Ability to Solve Previous Source Tasks
    A Combinatorial Perspective on Transfer Learning Jianan Wang Eren Sezener David Budden Marcus Hutter Joel Veness DeepMind [email protected] Abstract Human intelligence is characterized not only by the capacity to learn complex skills, but the ability to rapidly adapt and acquire new skills within an ever-changing environment. In this work we study how the learning of modular solutions can allow for effective generalization to both unseen and potentially differently distributed data. Our main postulate is that the combination of task segmentation, modular learning and memory-based ensembling can give rise to generalization on an exponentially growing number of unseen tasks. We provide a concrete instantiation of this idea using a combination of: (1) the Forget-Me-Not Process, for task segmentation and memory based ensembling; and (2) Gated Linear Networks, which in contrast to contemporary deep learning techniques use a modular and local learning mechanism. We demonstrate that this system exhibits a number of desirable continual learning properties: robustness to catastrophic forgetting, no negative transfer and increasing levels of positive transfer as more tasks are seen. We show competitive performance against both offline and online methods on standard continual learning benchmarks. 1 Introduction Humans learn new tasks from a single temporal stream (online learning) by efficiently transferring experience of previously encountered tasks (continual learning). Contemporary machine learning algorithms struggle in both of these settings, and few attempts have been made to solve challenges at their intersection. Despite obvious computational inefficiencies, the dominant machine learning paradigm involves i.i.d. sampling of data at massive scale to reduce gradient variance and stabilize training via back-propagation.
    [Show full text]
  • On the Differences Between Human and Machine Intelligence
    On the Differences between Human and Machine Intelligence Roman V. Yampolskiy Computer Science and Engineering, University of Louisville [email protected] Abstract [Legg and Hutter, 2007a]. However, widespread implicit as- Terms Artificial General Intelligence (AGI) and Hu- sumption of equivalence between capabilities of AGI and man-Level Artificial Intelligence (HLAI) have been HLAI appears to be unjustified, as humans are not general used interchangeably to refer to the Holy Grail of Ar- intelligences. In this paper, we will prove this distinction. tificial Intelligence (AI) research, creation of a ma- Others use slightly different nomenclature with respect to chine capable of achieving goals in a wide range of general intelligence, but arrive at similar conclusions. “Lo- environments. However, widespread implicit assump- cal generalization, or “robustness”: … “adaptation to tion of equivalence between capabilities of AGI and known unknowns within a single task or well-defined set of HLAI appears to be unjustified, as humans are not gen- tasks”. … Broad generalization, or “flexibility”: “adapta- eral intelligences. In this paper, we will prove this dis- tion to unknown unknowns across a broad category of re- tinction. lated tasks”. …Extreme generalization: human-centric ex- treme generalization, which is the specific case where the 1 Introduction1 scope considered is the space of tasks and domains that fit within the human experience. We … refer to “human-cen- Imagine that tomorrow a prominent technology company tric extreme generalization” as “generality”. Importantly, as announces that they have successfully created an Artificial we deliberately define generality here by using human cog- Intelligence (AI) and offers for you to test it out.
    [Show full text]
  • Universal Artificial Intelligence Philosophical, Mathematical, and Computational Foundations of Inductive Inference and Intelligent Agents the Learn
    Universal Artificial Intelligence philosophical, mathematical, and computational foundations of inductive inference and intelligent agents the learn Marcus Hutter Australian National University Canberra, ACT, 0200, Australia http://www.hutter1.net/ ANU Universal Artificial Intelligence - 2 - Marcus Hutter Abstract: Motivation The dream of creating artificial devices that reach or outperform human intelligence is an old one, however a computationally efficient theory of true intelligence has not been found yet, despite considerable efforts in the last 50 years. Nowadays most research is more modest, focussing on solving more narrow, specific problems, associated with only some aspects of intelligence, like playing chess or natural language translation, either as a goal in itself or as a bottom-up approach. The dual, top down approach, is to find a mathematical (not computational) definition of general intelligence. Note that the AI problem remains non-trivial even when ignoring computational aspects. Universal Artificial Intelligence - 3 - Marcus Hutter Abstract: Contents In this course we will develop such an elegant mathematical parameter-free theory of an optimal reinforcement learning agent embedded in an arbitrary unknown environment that possesses essentially all aspects of rational intelligence. Most of the course is devoted to giving an introduction to the key ingredients of this theory, which are important subjects in their own right: Occam's razor; Turing machines; Kolmogorov complexity; probability theory; Solomonoff induction; Bayesian
    [Show full text]
  • QKSA: Quantum Knowledge Seeking Agent
    QKSA: Quantum Knowledge Seeking Agent motivation, core thesis and baseline framework Aritra Sarkar Department of Quantum & Computer Engineering, Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, Delft, The Netherlands 1 Introduction In this article we present the motivation and the core thesis towards the implementation of a Quantum Knowledge Seeking Agent (QKSA). QKSA is a general reinforcement learning agent that can be used to model classical and quantum dynamics. It merges ideas from universal artificial general intelligence, con- structor theory and genetic programming to build a robust and general framework for testing the capabilities of the agent in a variety of environments. It takes the artificial life (or, animat) path to artificial general intelligence where a population of intelligent agents are instantiated to explore valid ways of modeling the perceptions. The multiplicity and survivability of the agents are defined by the fitness, with respect to the explainability and predictability, of a resource-bounded computational model of the environment. This general learning approach is then employed to model the physics of an environment based on subjective ob- server states of the agents. A specific case of quantum process tomography as a general modeling principle is presented. The various background ideas and a baseline formalism is discussed in this article which sets the groundwork for the implementations of the QKSA that are currently in active development. Section 2 presents a historic overview of the motivation behind this research In Section 3 we survey some general reinforcement learning models and bio-inspired computing techniques that forms a baseline for the design of the QKSA.
    [Show full text]
  • The Fastest and Shortest Algorithm for All Well-Defined Problems
    Technical Report IDSIA-16-00, 3 March 2001 ftp://ftp.idsia.ch/pub/techrep/IDSIA-16-00.ps.gz The Fastest and Shortest Algorithm for All Well-Defined Problems1 Marcus Hutter IDSIA, Galleria 2, CH-6928 Manno-Lugano, Switzerland [email protected] http://www.idsia.ch/∼marcus Key Words Acceleration, Computational Complexity, Algorithmic Information Theory, Kol- mogorov Complexity, Blum’s Speed-up Theorem, Levin Search. Abstract An algorithm M is described that solves any well-defined problem p as quickly as the fastest algorithm computing a solution to p, save for a factor of 5 and low- order additive terms. M optimally distributes resources between the execution arXiv:cs/0206022v1 [cs.CC] 14 Jun 2002 of provably correct p-solving programs and an enumeration of all proofs, including relevant proofs of program correctness and of time bounds on program runtimes. M avoids Blum’s speed-up theorem by ignoring programs without correctness proof. M has broader applicability and can be faster than Levin’s universal search, the fastest method for inverting functions save for a large multiplicative constant. An extension of Kolmogorov complexity and two novel natural measures of function complexity are used to show that the most efficient program computing some function f is also among the shortest programs provably computing f. 1Published in the International Journal of Foundations of Computer Science, Vol. 13, No. 3, (June 2002) 431–443. Extended version of An effective Procedure for Speeding up Algorithms (cs.CC/0102018) presented at Workshops MaBiC-2001 & TAI-2001. Marcus Hutter, The Fastest and Shortest Algorithm 1 1 Introduction & Main Result Searching for fast algorithms to solve certain problems is a central and difficult task in computer science.
    [Show full text]
  • The Relativity of Induction
    The Relativity of Induction Larry Muhlstein Department of Cognitive Science University of California San Diego San Diego, CA 94703 [email protected] Abstract Lately there has been a lot of discussion about why deep learning algorithms per- form better than we would theoretically suspect. To get insight into this question, it helps to improve our understanding of how learning works. We explore the core problem of generalization and show that long-accepted Occam’s razor and parsimony principles are insufficient to ground learning. Instead, we derive and demonstrate a set of relativistic principles that yield clearer insight into the nature and dynamics of learning. We show that concepts of simplicity are fundamentally contingent, that all learning operates relative to an initial guess, and that general- ization cannot be measured or strongly inferred, but that it can be expected given enough observation. Using these principles, we reconstruct our understanding in terms of distributed learning systems whose components inherit beliefs and update them. We then apply this perspective to elucidate the nature of some real world inductive processes including deep learning. Introduction We have recently noticed a significant gap between our theoretical understanding of how machine learning works and the results of our experiments and systems. The field has experienced a boom in the scope of problems that can be tackled with machine learning methods, yet is lagging behind in deeper understanding of these tools. Deep networks often generalize successfully despite having more parameters than examples[1], and serious concerns[2] have been raised that our full suite of theoretical techniques including VC theory[3], Rademacher complexity[4], and uniform convergence[5] are unable to account for why[6].
    [Show full text]
  • Comp4620/8620: Advanced Topics in AI Foundations of Artificial Intelligence
    comp4620/8620: Advanced Topics in AI Foundations of Artificial Intelligence Marcus Hutter Australian National University Canberra, ACT, 0200, Australia http://www.hutter1.net/ ANU Foundations of Artificial Intelligence - 2 - Marcus Hutter Abstract: Motivation The dream of creating artificial devices that reach or outperform human intelligence is an old one, however a computationally efficient theory of true intelligence has not been found yet, despite considerable efforts in the last 50 years. Nowadays most research is more modest, focussing on solving more narrow, specific problems, associated with only some aspects of intelligence, like playing chess or natural language translation, either as a goal in itself or as a bottom-up approach. The dual, top down approach, is to find a mathematical (not computational) definition of general intelligence. Note that the AI problem remains non-trivial even when ignoring computational aspects. Foundations of Artificial Intelligence - 3 - Marcus Hutter Abstract: Contents In this course we will develop such an elegant mathematical parameter-free theory of an optimal reinforcement learning agent embedded in an arbitrary unknown environment that possesses essentially all aspects of rational intelligence. Most of the course is devoted to giving an introduction to the key ingredients of this theory, which are important subjects in their own right: Occam's razor; Turing machines; Kolmogorov complexity; probability theory; Solomonoff induction; Bayesian sequence prediction; minimum description length principle; agents;
    [Show full text]
  • Realab: an Embedded Perspective on Tampering
    REALab: An Embedded Perspective on Tampering Ramana Kumar∗ Jonathan Uesato∗ Richard Ngo Tom Everitt Victoria Krakovna Shane Legg Abstract This paper describes REALab, a platform for embedded agency research in reinforce- ment learning (RL). REALab is designed to model the structure of tampering problems that may arise in real-world deployments of RL. Standard Markov Decision Process (MDP) formulations of RL and simulated environments mirroring the MDP structure assume secure access to feedback (e.g., rewards). This may be unrealistic in settings where agents are embedded and can corrupt the processes producing feedback (e.g., human supervisors, or an implemented reward function). We describe an alternative Corrupt Feedback MDP formulation and the REALab environment platform, which both avoid the secure feedback assumption. We hope the design of REALab provides a useful perspective on tampering problems, and that the platform may serve as a unit test for the presence of tampering incentives in RL agent designs. 1 Introduction Tampering problems, where an AI agent interferes with whatever represents or commu- nicates its intended objective and pursues the resulting corrupted objective instead, are a staple concern in the AGI safety literature [Amodei et al., 2016, Bostrom, 2014, Everitt and Hutter, 2016, Everitt et al., 2017, Armstrong and O’Rourke, 2017, Everitt and Hutter, 2019, Armstrong et al., 2020]. Variations on the idea of tampering include wireheading, where an agent learns how to stimulate its reward mechanism directly, and the off-switch or shutdown problem, where an agent interferes with its supervisor’s ability to halt the agent’s operation. Many real-world concerns can be formulated as tampering problems, as we will show (§2.1, §4.1).
    [Show full text]
  • Universal Artificial Intelligence
    Universal Artificial Intelligence Marcus Hutter Canberra, ACT, 0200, Australia http://www.hutter1.net/ ANU RSISE NICTA Machine Learning Summer School MLSS-2010, 27 September { 6 October, Canberra Marcus Hutter - 2 - Foundations of Intelligent Agents Abstract The dream of creating arti¯cial devices that reach or outperform human intelligence is many centuries old. This tutorial presents the elegant parameter-free theory, developed in [Hut05], of an optimal reinforcement learning agent embedded in an arbitrary unknown environment that possesses essentially all aspects of rational intelligence. The theory reduces all conceptual AI problems to pure computational questions. How to perform inductive inference is closely related to the AI problem. The tutorial covers Solomono®'s theory, elaborated on in [Hut07], which solves the induction problem, at least from a philosophical and statistical perspective. Both theories are based on Occam's razor quanti¯ed by Kolmogorov complexity; Bayesian probability theory; and sequential decision theory. Marcus Hutter - 3 - Foundations of Intelligent Agents TABLE OF CONTENTS 1. PHILOSOPHICAL ISSUES 2. BAYESIAN SEQUENCE PREDICTION 3. UNIVERSAL INDUCTIVE INFERENCE 4. UNIVERSAL ARTIFICIAL INTELLIGENCE 5. APPROXIMATIONS AND APPLICATIONS 6. FUTURE DIRECTION, WRAP UP, LITERATURE Marcus Hutter - 4 - Foundations of Intelligent Agents PHILOSOPHICAL ISSUES ² Philosophical Problems ² What is (Arti¯cial) Intelligence? ² How to do Inductive Inference? ² How to Predict (Number) Sequences? ² How to make Decisions in Unknown Environments? ² Occam's Razor to the Rescue ² The Grue Emerald and Con¯rmation Paradoxes ² What this Tutorial is (Not) About Marcus Hutter - 5 - Foundations of Intelligent Agents Philosophical Issues: Abstract I start by considering the philosophical problems concerning arti¯cial intelligence and machine learning in general and induction in particular.
    [Show full text]
  • Categorizing Wireheading in Partially Embedded Agents
    Categorizing Wireheading in Partially Embedded Agents Arushi Majha1 , Sayan Sarkar2 and Davide Zagami3 1University of Cambridge 2IISER Pune 3RAISE ∗ [email protected], [email protected], [email protected] Abstract able to locate and edit that memory address to contain what- ever value corresponds to the highest reward. Chances that Embedded agents are not explicitly separated from such behavior will be incentivized increase as we develop their environment, lacking clear I/O channels. Such ever more intelligent agents1. agents can reason about and modify their internal The discussion of AI systems has thus far been dominated parts, which they are incentivized to shortcut or by dualistic models where the agent is clearly separated from wirehead in order to achieve the maximal reward. its environment, has well-defined input/output channels, and In this paper, we provide a taxonomy of ways by does not have any control over the design of its internal parts. which wireheading can occur, followed by a defini- Recent work on these problems [Demski and Garrabrant, tion of wirehead-vulnerable agents. Starting from 2019; Everitt and Hutter, 2018; Everitt et al., 2019] provides the fully dualistic universal agent AIXI, we intro- a taxonomy of ways in which embedded agents violate essen- duce a spectrum of partially embedded agents and tial assumptions that are usually granted in dualistic formula- identify wireheading opportunities that such agents tions, such as with the universal agent AIXI [Hutter, 2004]. can exploit, experimentally demonstrating the re- Wireheading can be considered one particular class of mis- sults with the GRL simulation platform AIXIjs. We alignment [Everitt and Hutter, 2018], a divergence between contextualize wireheading in the broader class of the goals of the agent and the goals of its designers.
    [Show full text]
  • On Thompson Sampling and Asymptotic Optimality∗
    On Thompson Sampling and Asymptotic Optimality∗ Jan Leike Tor Lattimore Laurent Orseau Marcus Hutter DeepMind Indiana University DeepMind Australian National University Future of Humanity Institute University of Oxford Abstract We discuss two different notions of optimality: asymptotic optimality and worst-case regret. We discuss some recent results on Thompson sam- Asymptotic optimality [Lattimore and Hutter, 2011] re- pling for nonparametric reinforcement learning in quires that asymptotically the agent learns to act optimally, countable classes of general stochastic environ- i.e., that the discounted value of the agent’s policy π con- ments. These environments can be non-Markovian, verges to the optimal discounted value for every environ- non-ergodic, and partially observable. We show ment from the environment class. Asymptotic optimality can that Thompson sampling learns the environment be achieved through an exploration component on top of a class in the sense that (1) asymptotically its value Bayes-optimal agent [Lattimore, 2013, Ch. 5] or through op- converges in mean to the optimal value and (2) timism [Sunehag and Hutter, 2015]. given a recoverability assumption regret is sublin- Asymptotic optimality in mean is essentially a qualitative ear. We conclude with a discussion about optimal- version of probably approximately correct (PAC) that comes ity in reinforcement learning. without a concrete convergence rate: for all " > 0 and δ > 0 the probability that our policy is "-suboptimal converges to Keywords. General reinforcement learning, Thompson zero (at an unknown rate). Eventually this probability will be sampling, asymptotic optimality, regret, discounting, recov- less than δ forever thereafter. Since our environment class can erability, AIXI. be very large and non-compact, concrete PAC/convergence rates are likely impossible.
    [Show full text]