Q-Learning Christopher Watkins and Peter Dayan

Total Page:16

File Type:pdf, Size:1020Kb

Q-Learning Christopher Watkins and Peter Dayan Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 1 / 21 Outline 1 Introduction What is reinforcement learning? Modeling the problem Bellman optimality 2 Q-Learning Algorithm Convergence analysis 3 Assumptions and Limitations 4 Summary Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 2 / 21 Introduction Outline 1 Introduction What is reinforcement learning? Modeling the problem Bellman optimality 2 Q-Learning Algorithm Convergence analysis 3 Assumptions and Limitations 4 Summary Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 3 / 21 It is not supervised learning, and it is not unsupervised learning There is no expert to tell the learning agent right from wrong - it is forced to learn from its own experience The learning agent receives feedback from its environment - it can learn by trying different actions at different situations There is a tradeoff between exploration and exploitation Introduction What is reinforcement learning? Reinforcement Learning: Agent-Environment Interaction Reinforcement learning is learning how to behave in order to maximize value or achieve some goal Environment Agent Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 4 / 21 There is a tradeoff between exploration and exploitation Introduction What is reinforcement learning? Reinforcement Learning: Agent-Environment Interaction Reinforcement learning is learning how to behave in order to maximize value or achieve some goal It is not supervised learning, and it is not unsupervised learning There is no expert to tell the learning agent right from wrong - it is forced to learn from its own experience The learning agent receives feedback from its environment - it can learn by trying different actions at different situations Environment Agent Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 4 / 21 Introduction What is reinforcement learning? Reinforcement Learning: Agent-Environment Interaction Reinforcement learning is learning how to behave in order to maximize value or achieve some goal It is not supervised learning, and it is not unsupervised learning There is no expert to tell the learning agent right from wrong - it is forced to learn from its own experience The learning agent receives feedback from its environment - it can learn by trying different actions at different situations There is a tradeoff between exploration and exploitation Environment Agent Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 4 / 21 Introduction What is reinforcement learning? Reinforcement Learning: Agent-Environment Interaction Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 5 / 21 Introduction What is reinforcement learning? Toy Problem sf s0 s Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 6 / 21 A 2 A - the agent’s action p (s0js, a)- the dynamics of the system r : S × A ! R - the reward function (possibly stochastic) Definition MDP is the tuple hS, A, p, ri p (ajs) - the agent’s policy St−1 St St+1 St+2 p p p Rt−1 Rt Rt+1 p p p At−1 At At+1 Introduction Modeling the problem Markov Decision Process (MDP) S 2 S - the state of the system Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 7 / 21 p (s0js, a)- the dynamics of the system r : S × A ! R - the reward function (possibly stochastic) Definition MDP is the tuple hS, A, p, ri p (ajs) - the agent’s policy St−1 St St+1 St+2 p p p Rt−1 Rt Rt+1 p p p At−1 At At+1 Introduction Modeling the problem Markov Decision Process (MDP) S 2 S - the state of the system A 2 A - the agent’s action Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 7 / 21 r : S × A ! R - the reward function (possibly stochastic) Definition MDP is the tuple hS, A, p, ri p (ajs) - the agent’s policy St−1 St St+1 St+2 p p p Rt−1 Rt Rt+1 p p p At−1 At At+1 Introduction Modeling the problem Markov Decision Process (MDP) S 2 S - the state of the system A 2 A - the agent’s action p (s0js, a)- the dynamics of the system Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 7 / 21 p (ajs) - the agent’s policy St−1 St St+1 St+2 p p p Rt−1 Rt Rt+1 p p p At−1 At At+1 Introduction Modeling the problem Markov Decision Process (MDP) S 2 S - the state of the system A 2 A - the agent’s action p (s0js, a)- the dynamics of the system r : S × A ! R - the reward function (possibly stochastic) Definition MDP is the tuple hS, A, p, ri Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 7 / 21 St−1 St St+1 St+2 p p p Rt−1 Rt Rt+1 p p p At−1 At At+1 Introduction Modeling the problem Markov Decision Process (MDP) S 2 S - the state of the system A 2 A - the agent’s action p (s0js, a)- the dynamics of the system r : S × A ! R - the reward function (possibly stochastic) Definition MDP is the tuple hS, A, p, ri p (ajs) - the agent’s policy Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 7 / 21 Introduction Modeling the problem Markov Decision Process (MDP) S 2 S - the state of the system A 2 A - the agent’s action p (s0js, a)- the dynamics of the system r : S × A ! R - the reward function (possibly stochastic) Definition MDP is the tuple hS, A, p, ri p (ajs) - the agent’s policy St−1 St St+1 St+2 p p p Rt−1 Rt Rt+1 p p p At−1 At At+1 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 7 / 21 The discounted value function for any given policy p " # ¥ p t V (s) = E ∑ g r (st, at) s0 = s t=0 0 < g < 1 is the discounting factor The goal: find a policy p∗ that maximizes the value 8s 2 S ∗ V∗ (s) = Vp (s) = max Vp (s) p Introduction Modeling the problem The Value Function The un-discounted value function for any given policy p " # T p V (s) = E ∑ r (st, at) s0 = s t=0 T is the horizon - can be finite, episodic, or infinite Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 8 / 21 The goal: find a policy p∗ that maximizes the value 8s 2 S ∗ V∗ (s) = Vp (s) = max Vp (s) p Introduction Modeling the problem The Value Function The un-discounted value function for any given policy p " # T p V (s) = E ∑ r (st, at) s0 = s t=0 T is the horizon - can be finite, episodic, or infinite The discounted value function for any given policy p " # ¥ p t V (s) = E ∑ g r (st, at) s0 = s t=0 0 < g < 1 is the discounting factor Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 8 / 21 Introduction Modeling the problem The Value Function The un-discounted value function for any given policy p " # T p V (s) = E ∑ r (st, at) s0 = s t=0 T is the horizon - can be finite, episodic, or infinite The discounted value function for any given policy p " # ¥ p t V (s) = E ∑ g r (st, at) s0 = s t=0 0 < g < 1 is the discounting factor The goal: find a policy p∗ that maximizes the value 8s 2 S ∗ V∗ (s) = Vp (s) = max Vp (s) p Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 8 / 21 The optimal value satisfies the Bellman equation V∗ (s) = max E r (s, a) + gV∗ s0 p a∼p(·js) s0∼p(·js,a) p∗ is not necessarily unique, but V∗ is unique Introduction Bellman optimality The Bellman Equation The value function can be written recursively " ¥ # p t p 0 V (s) = E g r (st, at) s0 = s = E r (s, a) + gV s ∑ ∼ ·j t=0 a p( s) s0∼p(·js,a) Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 9 / 21 p∗ is not necessarily unique, but V∗ is unique Introduction Bellman optimality The Bellman Equation The value function can be written recursively " ¥ # p t p 0 V (s) = E g r (st, at) s0 = s = E r (s, a) + gV s ∑ ∼ ·j t=0 a p( s) s0∼p(·js,a) The optimal value satisfies the Bellman equation V∗ (s) = max E r (s, a) + gV∗ s0 p a∼p(·js) s0∼p(·js,a) Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 9 / 21 Introduction Bellman optimality The Bellman Equation The value function can be written recursively " ¥ # p t p 0 V (s) = E g r (st, at) s0 = s = E r (s, a) + gV s ∑ ∼ ·j t=0 a p( s) s0∼p(·js,a) The optimal value satisfies the Bellman equation V∗ (s) = max E r (s, a) + gV∗ s0 p a∼p(·js) s0∼p(·js,a) p∗ is not necessarily unique, but V∗ is unique Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 9 / 21 If we know V∗ then an optimal policy is to decide deterministically a∗ (s) = arg max Q∗ (s, a) a This means that V∗ (s) = max Q∗ (s, a) a Conclusion Learning Q∗ makes it easy to obtain an optimal policy Introduction Bellman optimality The Q-Function The value of taking action a at state s: Q (s, a) = r (s, a) + g E V s0 s0∼p(·js,a) Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 10 / 21 This means that V∗ (s) = max Q∗ (s, a) a Conclusion Learning Q∗ makes it easy to obtain an optimal policy Introduction Bellman optimality The Q-Function The value of taking action a at state s: Q (s, a) = r (s, a) + g E V s0 s0∼p(·js,a) If we know V∗ then an optimal policy is to decide deterministically a∗ (s) = arg max Q∗ (s, a) a Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 10 / 21 Conclusion Learning Q∗ makes it easy to obtain an optimal policy Introduction Bellman optimality The Q-Function The value of taking action a at state s: Q (s, a) = r (s, a) + g E V s0 s0∼p(·js,a) If we know V∗ then an optimal policy is to decide deterministically a∗ (s) = arg max Q∗ (s, a) a This means that V∗ (s) = max Q∗ (s, a) a Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 10 / 21 Introduction Bellman optimality The Q-Function The value of taking action a at
Recommended publications
  • Machine Learning for Neuroscience Geoffrey E Hinton
    Hinton Neural Systems & Circuits 2011, 1:12 http://www.neuralsystemsandcircuits.com/content/1/1/12 QUESTIONS&ANSWERS Open Access Machine learning for neuroscience Geoffrey E Hinton What is machine learning? Could you provide a brief description of the Machine learning is a type of statistics that places parti- methods of machine learning? cular emphasis on the use of advanced computational Machine learning can be divided into three parts: 1) in algorithms. As computers become more powerful, and supervised learning, the aim is to predict a class label or modern experimental methods in areas such as imaging a real value from an input (classifying objects in images generate vast bodies of data, machine learning is becom- or predicting the future value of a stock are examples of ing ever more important for extracting reliable and this type of learning); 2) in unsupervised learning, the meaningful relationships and for making accurate pre- aim is to discover good features for representing the dictions. Key strands of modern machine learning grew input data; and 3) in reinforcement learning, the aim is out of attempts to understand how large numbers of to discover what action should be performed next in interconnected, more or less neuron-like elements could order to maximize the eventual payoff. learn to achieve behaviourally meaningful computations and to extract useful features from images or sound What does machine learning have to do with waves. neuroscience? By the 1990s, key approaches had converged on an Machine learning has two very different relationships to elegant framework called ‘graphical models’, explained neuroscience. As with any other science, modern in Koller and Friedman, in which the nodes of a graph machine-learning methods can be very helpful in analys- represent variables such as edges and corners in an ing the data.
    [Show full text]
  • Genetic Algorithms Are a Competitive Alternative for Training Deep Neural Networks for Reinforcement Learning
    Deep Neuroevolution: Genetic Algorithms are a Competitive Alternative for Training Deep Neural Networks for Reinforcement Learning Felipe Petroski Such Vashisht Madhavan Edoardo Conti Joel Lehman Kenneth O. Stanley Jeff Clune Uber AI Labs ffelipe.such, [email protected] Abstract 1. Introduction Deep artificial neural networks (DNNs) are typ- A recent trend in machine learning and AI research is that ically trained via gradient-based learning al- old algorithms work remarkably well when combined with gorithms, namely backpropagation. Evolution sufficient computing resources and data. That has been strategies (ES) can rival backprop-based algo- the story for (1) backpropagation applied to deep neu- rithms such as Q-learning and policy gradients ral networks in supervised learning tasks such as com- on challenging deep reinforcement learning (RL) puter vision (Krizhevsky et al., 2012) and voice recog- problems. However, ES can be considered a nition (Seide et al., 2011), (2) backpropagation for deep gradient-based algorithm because it performs neural networks combined with traditional reinforcement stochastic gradient descent via an operation sim- learning algorithms, such as Q-learning (Watkins & Dayan, ilar to a finite-difference approximation of the 1992; Mnih et al., 2015) or policy gradient (PG) methods gradient. That raises the question of whether (Sehnke et al., 2010; Mnih et al., 2016), and (3) evolution non-gradient-based evolutionary algorithms can strategies (ES) applied to reinforcement learning bench- work at DNN scales. Here we demonstrate they marks (Salimans et al., 2017). One common theme is can: we evolve the weights of a DNN with a sim- that all of these methods are gradient-based, including ES, ple, gradient-free, population-based genetic al- which involves a gradient approximation similar to finite gorithm (GA) and it performs well on hard deep differences (Williams, 1992; Wierstra et al., 2008; Sali- RL problems, including Atari and humanoid lo- mans et al., 2017).
    [Show full text]
  • Acetylcholine in Cortical Inference
    Neural Networks 15 (2002) 719–730 www.elsevier.com/locate/neunet 2002 Special issue Acetylcholine in cortical inference Angela J. Yu*, Peter Dayan Gatsby Computational Neuroscience Unit, University College London, 17 Queen Square, London WC1N 3AR, UK Received 5 October 2001; accepted 2 April 2002 Abstract Acetylcholine (ACh) plays an important role in a wide variety of cognitive tasks, such as perception, selective attention, associative learning, and memory. Extensive experimental and theoretical work in tasks involving learning and memory has suggested that ACh reports on unfamiliarity and controls plasticity and effective network connectivity. Based on these computational and implementational insights, we develop a theory of cholinergic modulation in perceptual inference. We propose that ACh levels reflect the uncertainty associated with top- down information, and have the effect of modulating the interaction between top-down and bottom-up processing in determining the appropriate neural representations for inputs. We illustrate our proposal by means of an hierarchical hidden Markov model, showing that cholinergic modulation of contextual information leads to appropriate perceptual inference. q 2002 Elsevier Science Ltd. All rights reserved. Keywords: Acetylcholine; Perception; Neuromodulation; Representational inference; Hidden Markov model; Attention 1. Introduction medial septum (MS), diagonal band of Broca (DBB), and nucleus basalis (NBM). Physiological studies on ACh Neuromodulators such as acetylcholine (ACh), sero- indicate that its neuromodulatory effects at the cellular tonine, dopamine, norepinephrine, and histamine play two level are diverse, causing synaptic facilitation and suppres- characteristic roles. One, most studied in vertebrate systems, sion as well as direct hyperpolarization and depolarization, concerns the control of plasticity. The other, most studied in all within the same cortical area (Kimura, Fukuda, & invertebrate systems, concerns the control of network Tsumoto, 1999).
    [Show full text]
  • Model-Free Episodic Control
    Model-Free Episodic Control Charles Blundell Benigno Uria Alexander Pritzel Google DeepMind Google DeepMind Google DeepMind [email protected] [email protected] [email protected] Yazhe Li Avraham Ruderman Joel Z Leibo Jack Rae Google DeepMind Google DeepMind Google DeepMind Google DeepMind [email protected] [email protected] [email protected] [email protected] Daan Wierstra Demis Hassabis Google DeepMind Google DeepMind [email protected] [email protected] Abstract State of the art deep reinforcement learning algorithms take many millions of inter- actions to attain human-level performance. Humans, on the other hand, can very quickly exploit highly rewarding nuances of an environment upon first discovery. In the brain, such rapid learning is thought to depend on the hippocampus and its capacity for episodic memory. Here we investigate whether a simple model of hippocampal episodic control can learn to solve difficult sequential decision- making tasks. We demonstrate that it not only attains a highly rewarding strategy significantly faster than state-of-the-art deep reinforcement learning algorithms, but also achieves a higher overall reward on some of the more challenging domains. 1 Introduction Deep reinforcement learning has recently achieved notable successes in a variety of domains [23, 32]. However, it is very data inefficient. For example, in the domain of Atari games [2], deep Reinforcement Learning (RL) systems typically require tens of millions of interactions with the game emulator, amounting to hundreds of hours of game play, to achieve human-level performance. As pointed out by [13], humans learn to play these games much faster. This paper addresses the question of how to emulate such fast learning abilities in a machine—without any domain-specific prior knowledge.
    [Show full text]
  • When Planning to Survive Goes Wrong: Predicting the Future and Replaying the Past in Anxiety and PTSD
    Available online at www.sciencedirect.com ScienceDirect When planning to survive goes wrong: predicting the future and replaying the past in anxiety and PTSD 1 3 1,2 Christopher Gagne , Peter Dayan and Sonia J Bishop We increase our probability of survival and wellbeing by lay out this computational problem as a form of approxi- minimizing our exposure to rare, extremely negative events. In mate Bayesian decision theory (BDT) [1] and consider this article, we examine the computations used to predict and how miscalibrated attempts to solve it might contribute to avoid such events and to update our models of the world and anxiety and stress disorders. action policies after their occurrence. We also consider how these computations might go wrong in anxiety disorders and According to BDT, we should combine a probability Post Traumatic Stress Disorder (PTSD). We review evidence distribution over all relevant states of the world with that anxiety is linked to increased simulations of the future estimates of the benefits or costs of outcomes associated occurrence of high cost negative events and to elevated with each state. We must then calculate the course of estimates of the probability of occurrence of such events. We action that delivers the largest long-run expected value. also review psychological theories of PTSD in the light of newer, Individuals can only possibly approximately solve this computational models of updating through replay and problem. To do so, they bring to bear different sources of simulation. We consider whether pathological levels of re- information (e.g. priors, evidence, models of the world) experiencing symptomatology might reflect problems and apply different methods to calculate the expected reconciling the traumatic outcome with overly optimistic priors long-run values of alternate courses of action.
    [Show full text]
  • Pnas11052ackreviewers 5098..5136
    Acknowledgment of Reviewers, 2013 The PNAS editors would like to thank all the individuals who dedicated their considerable time and expertise to the journal by serving as reviewers in 2013. Their generous contribution is deeply appreciated. A Harald Ade Takaaki Akaike Heather Allen Ariel Amir Scott Aaronson Karen Adelman Katerina Akassoglou Icarus Allen Ido Amit Stuart Aaronson Zach Adelman Arne Akbar John Allen Angelika Amon Adam Abate Pia Adelroth Erol Akcay Karen Allen Hubert Amrein Abul Abbas David Adelson Mark Akeson Lisa Allen Serge Amselem Tarek Abbas Alan Aderem Anna Akhmanova Nicola Allen Derk Amsen Jonathan Abbatt Neil Adger Shizuo Akira Paul Allen Esther Amstad Shahal Abbo Noam Adir Ramesh Akkina Philip Allen I. Jonathan Amster Patrick Abbot Jess Adkins Klaus Aktories Toby Allen Ronald Amundson Albert Abbott Elizabeth Adkins-Regan Muhammad Alam James Allison Katrin Amunts Geoff Abbott Roee Admon Eric Alani Mead Allison Myron Amusia Larry Abbott Walter Adriani Pietro Alano Isabel Allona Gynheung An Nicholas Abbott Ruedi Aebersold Cedric Alaux Robin Allshire Zhiqiang An Rasha Abdel Rahman Ueli Aebi Maher Alayyoubi Abigail Allwood Ranjit Anand Zalfa Abdel-Malek Martin Aeschlimann Richard Alba Julian Allwood Beau Ances Minori Abe Ruslan Afasizhev Salim Al-Babili Eric Alm David Andelman Kathryn Abel Markus Affolter Salvatore Albani Benjamin Alman John Anderies Asa Abeliovich Dritan Agalliu Silas Alben Steven Almo Gregor Anderluh John Aber David Agard Mark Alber Douglas Almond Bogi Andersen Geoff Abers Aneel Aggarwal Reka Albert Genevieve Almouzni George Andersen Rohan Abeyaratne Anurag Agrawal R. Craig Albertson Noga Alon Gregers Andersen Susan Abmayr Arun Agrawal Roy Alcalay Uri Alon Ken Andersen Ehab Abouheif Paul Agris Antonio Alcami Claudio Alonso Olaf Andersen Soman Abraham H.
    [Show full text]
  • Technical Note Q-Learning
    Machine Learning, 8, 279-292 (1992) © 1992 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands. Technical Note Q-Learning CHRISTOPHER J.C.H. WATKINS 25b Framfield Road, Highbury, London N5 1UU, England PETER DAYAN Centre for Cognitive Science, University of Edinburgh, 2 Buccleuch Place, Edinburgh EH8 9EH, Scotland Abstract. ~-learning (Watkins, 1989) is a simpleway for agentsto learn how to act optimallyin controlledMarkovian domains. It amounts to an incremental method for dynamic programming which imposes limited computational demands. It works by successivelyimproving its evaluationsof the quality of particular actions at particular states. This paper presents and proves in detail a convergencetheorem for ~-learning based on that outlined in Watkins (1989). We show that 0~-learningconverges to the optimum action-valueswith probability 1 so long as all actions are repeatedly sampled in all states and the action-values are represented discretely. We also sketch extensions to the cases of non-discounted, but absorbing, Markov environments, and where many O~values can be changed each iteration, rather than just one. Keywords. 0~-learning, reinforcement learning, temporal differences, asynchronous dynamic programming 1. Introduction O~-learning (Watkins, 1989) is a form of model-free reinforcement learning. It can also be viewed as a method of asynchronous dynamic programming (DP). It provides agents with the capability of learning to act optimally in Markovian domains by experiencing the con- sequences of actions, without requiring them to build maps of the domains. Learning proceeds similarly to Sutton's (1984; 1988) method of temporal differences (TD): an agent tries an action at a particular state, and evaluates its consequences in terms of the immediate reward or penalty it receives and its estimate of the value of the state to which it is taken.
    [Show full text]
  • Human Functional Brain Imaging 1990–2009
    Portfolio Review Human Functional Brain Imaging 1990–2009 September 2011 Acknowledgements The Wellcome Trust would like to thank the many people who generously gave up their time to participate in this review. The project was led by Claire Vaughan and Liz Allen. Key input and support was provided by Lynsey Bilsland, Richard Morris, John Williams, Shewly Choudhury, Kathryn Adcock, David Lynn, Kevin Dolby, Beth Thompson, Anna Wade, Suzi Morris, Annie Sanderson, and Jo Scott; and Lois Reynolds and Tilli Tansey (Wellcome Trust Expert Group). The views expressed in this report are those of the Wellcome Trust project team, drawing on the evidence compiled during the review. We are indebted to the independent Expert Group and our industry experts, who were pivotal in providing the assessments of the Trust’s role in supporting human functional brain imaging and have informed ‘our’ speculations for the future. Finally, we would like to thank Professor Randy Buckner, Professor Ray Dolan and Dr Anne-Marie Engel, who provided valuable input to the development of the timelines and report. The2 | Portfolio Wellcome Review: Trust Human is a Functional charity registeredBrain Imaging in England and Wales, no. 210183. Contents Acknowledgements 2 Key abbreviations used in the report 4 Overview and key findings 4 Landmarks in human functional brain imaging 10 1. Introduction and background 12 2 Human functional brain imaging today: the global research landscape 14 2.1 The global scene 14 2.2 The UK 15 2.3 Europe 17 2.4 Industry 17 2.5 Human brain imaging
    [Show full text]
  • Active Inference on Discrete State-Spaces – a Synthesis
    ACTIVE INFERENCE ON DISCRETE STATE-SPACES – A SYNTHESIS APREPRINT Lancelot Da Costa1;2;∗, Thomas Parr2, Noor Sajid2, Sebastijan Veselic2, Victorita Neacsu2, Karl Friston2 1Department of Mathematics, Imperial College London 2Wellcome Centre for Human Neuroimaging, University College London March 31, 2020 ABSTRACT Active inference is a normative principle underwriting perception, action, planning, decision-making and learning in biological or artificial agents. From its inception, its associated process theory has grown to incorporate complex generative models, enabling simulation of a wide range of complex behaviours. Due to successive developments in active inference, it is often difficult to see how its underlying principle relates to process theories and practical implementation. In this paper, we try to bridge this gap by providing a complete mathematical synthesis of active inference on discrete state-space models. This technical summary provides an overview of the theory, derives neuronal dynamics from first principles and relates this dynamics to biological processes. Furthermore, this paper provides a fundamental building block needed to understand active inference for mixed gener- ative models; allowing continuous sensations to inform discrete representations. This paper may be used as follows: to guide research towards outstanding challenges, a practical guide on how to im- plement active inference to simulate experimental behaviour, or a pointer towards various in-silico neurophysiological responses that may be used to make empirical predictions. Keywords: active inference, free energy principle, process theory, variational Bayesian inference, Markov decision process, explained. Contents 1 Introduction 2 arXiv:2001.07203v2 [q-bio.NC] 28 Mar 2020 2 Active inference 4 3 Discrete state-space generative models 5 4 Variational Bayesian inference 7 4.1 Free energy and model evidence .
    [Show full text]
  • Winners of the Brain Prize 2017
    Ground-breaking research into learning honoured with the world’s largest brain research prize The Lundbeck Foundation's major research prize – The Brain Prize – goes this year to three UK-based brain researchers for explaining how learning is associated with the reward system of the brain. The prizewinners have found a key to understanding the mechanisms in the brain that lead to compulsive gambling, drug addiction and alcoholism. Sophie is surprised and delighted by the great applause she receives for the new way she plays a piece of music. The applause motivates her to continue learning and improving and, perhaps, even become a professional musician one day. The applause is an unexpected reward. This unexpected reward is associated with an increased release of the brain’s neurotransmitter dopamine in specific brain cells, stimulating learning and motivation. The three winners of the 2017 Brain Prize, English Peter Dayan, Irish Ray Dolan and German Wolfram Schultz, have identified how learning is linked with anticipation of reward, as in Sophie’s case, giving us fundamental knowledge about how we learn from our actions. Through animal testing, mathematical modelling and human trials, the three prizewinners have proven that the release of dopamine is not a response to the actual reward but to the difference between the reward we expect and the reward we actually receive. The greater the surprise, the more dopamine is released. The Brain Prize is for 1 million euros, or approximately 7.5 Danish kroner, and is the world's largest brain research prize. The organisation behind the prize is the Lundbeck Foundation, one of Denmark's largest sponsors of biomedical sciences research.
    [Show full text]
  • Guaranteeing the Learning of Ethical Behaviour Through Multi-Objective Reinforcement Learning∗
    Guaranteeing the Learning of Ethical Behaviour through Multi-Objective Reinforcement Learning∗ Manel Rodriguez-Soto Maite Lopez-Sanchez Juan A. Rodriguez-Aguilar Artificial Intelligence Universitat de Barcelona (UB) Artificial Intelligence Research Institute (IIIA-CSIC) Barcelona, Spain Research Institute (IIIA-CSIC) Bellaterra, Spain [email protected] Bellaterra, Spain [email protected] [email protected] ABSTRACT reward with the ethical reward (e.g. [19, 30]). However, to the best AI research is being challenged with ensuring that autonomous of our knowledge, there are no studies following the linear scalar- agents behave ethically, namely in alignment with moral values. A isation approach that offer theoretical guarantees regarding the common approach, founded on the exploitation of Reinforcement learning of ethical behaviours. Furthermore, [27] point out some Learning techniques, is to design environments that incentivise shortages regarding the adoption of a linear ethical embedding: the agents to learn an ethical behaviour. However, to the best of our agent’s learnt behaviour will be heavily influenced by the relative knowledge, current approaches do not offer theoretical guarantees scale of the individual rewards. This issue is specially relevant when that an agent will learn an ethical behaviour. Here, we advance the ethical objective must be wholly fulfilled (e.g., a robot in charge along this direction by proposing a novel way of designing envi- of buying an object should never decide to steal it [3]). For those ronments wherein it is formally guaranteed that an agent learns cases, we argue that the embedding must be done in such a way that to behave ethically while pursuing its individual objective.
    [Show full text]
  • Annual Report 2018 Edition TABLE of CONTENTS 2018
    Annual Report 2018 Edition TABLE OF CONTENTS 2018 GREETINGS 3 Letter From the President 4 Letter From the Chair FLATIRON INSTITUTE 7 Developing the Common Language of Computational Science 9 Kavli Summer Program in Astrophysics 12 Toward a Grand Unified Theory of Spindles 14 Building a Network That Learns Like We Do 16 A Many-Method Attack on the Many Electron Problem MATHEMATICS AND PHYSICAL SCIENCES 21 Arithmetic Geometry, Number Theory and Computation 24 Origins of the Universe 26 Cracking the Glass Problem LIFE SCIENCES 31 Computational Biogeochemical Modeling of Marine Ecosystems (CBIOMES) 34 Simons Collaborative Marine Atlas Project 36 A Global Approach to Neuroscience AUTISM RESEARCH INITIATIVE (SFARI) 41 SFARI’s Data Infrastructure for Autism Discovery 44 SFARI Research Roundup 46 The SPARK Gambit OUTREACH AND EDUCATION 51 Science Sandbox: “The Most Unknown” 54 Math for America: The Muller Award SIMONS FOUNDATION 56 Financials 58 Flatiron Institute Scientists 60 Mathematics and Physical Sciences Investigators 62 Mathematics and Physical Sciences Fellows 63 Life Sciences Investigators 65 Life Sciences Fellows 66 SFARI Investigators 68 Outreach and Education 69 Simons Society of Fellows 70 Supported Institutions 71 Advisory Boards 73 Board of Directors 74 Simons Foundation Staff 3 LETTER FROM THE PRESIDENT As one year ends and a new one begins, it is always a In the pages that follow, you will also read about the great pleasure to look back over the preceding 12 months foundation’s grant-making in Mathematics and Physical and reflect on all the fascinating and innovative ideas Sciences, Life Sciences, autism science (SFARI), Outreach conceived, supported, researched and deliberated at the and Education, and our Simons Collaborations.
    [Show full text]