Reinforcement Learning, Fast and Slow

Reinforcement Learning, Fast and Slow Matthew Botvinick1;2, Sam Ritter1;3, Jane X. Wang1 Zeb Kurth-Nelson1;2, Charles Blundell1, Demis Hassabis1;2 1DeepMind, UK, 2University College London, UK 3Princeton University, USA March 22, 2019 Abstract Deep reinforcement learning (RL) methods have driven impressive advances in artificial intelligence in recent years, exceeding human performance in domains ranging from Atari to Go to no-limit poker. This progress has drawn the attention of cognitive scientists interested in understanding human learning. However, the concern has been raised that deep RL may be too sample inefficient { that is, it may simply be too slow { to provide a plausible model of how humans learn. In the present review, we counter this critique by describing recently developed techniques that allow deep RL to operate more nimbly, solving problems much more quickly than previous methods. Although these techniques were developed in an AI context, we propose that they may have rich implications for psychology and neuroscience. A key insight, arising from these AI methods, concerns the fundamental connection between fast RL and slower, more incremental forms of learning. Powerful but slow: The first wave of Deep RL Over just the past few years, revolutionary advances have occurred in artificial intelligence research, where a resurgence in neural network or `deep learning' methods [1, 2] has fueled breakthroughs in image understanding [3, 4], natural language processing [5, 6], and many other areas. These developments have attracted growing interest from psychologists, psycholinguists and neuroscientists, curious about whether developments in AI might suggest new hypotheses concerning human cognition and brain function [7, 8, 9, 10, 11]. One area of AI research that appears particularly inviting from this perspective is deep reinforcement learning (Box 1). Deep RL marries neural network modeling with reinforcement learning, a set of methods for learning from rewards and punishments rather than from more explicit instruction [12]. After decades as an aspirational rather than practical idea, deep RL has within the last five years exploded into one of the most intense areas of AI research, generating super-human performance in tasks from video games [13] to poker [14] to multiplayer contests [15] to complex board games including go and chess [16, 17, 18, 19]. Beyond its inherent interest as an AI topic, deep RL would appear to hold special interest for psychology and neuroscience. The mechanisms that drive learning in deep RL were originally inspired by animal conditioning research [20], and are believed to relate closely to neural mechanisms for reward- based learning centering on dopamine [21]. At the same time, deep RL leverages neural networks to learn powerful representations that support generalization and transfer, key abilities of biological brains. Given these connections, deep RL would appear to offer a rich source of ideas and hypotheses for researchers interested in human and animal learning, both at the behavioral and neuroscientific levels. And indeed, researchers have started to take notice [7, 8]. At the same time, commentary on the first wave of deep RL research has also sounded a note of caution. On first blush it appears that deep RL systems learn in a fashion quite different from humans. The hallmark of this difference, it has been argued, lies in the sample efficiency of human learning versus deep RL. Sample efficiency refers to the amount of data required for a learning system to attain any chosen target level of performance. On this measure, the initial wave of deep RL systems indeed appear drastically different from human learners. To attain expert human-level performance on tasks such as Atari video games or chess, deep RL systems have required many orders of magnitude more training 1 data than human experts themselves [22]. In short, deep RL, at least in its initial incarnation, appears much too slow to offer a plausible model for human learning. Or so the argument has gone [23, 24]. The critique is indeed applicable to the first wave of deep RL methods, reported beginning around 2013 [e.g. 25]. However, even in the short time since then, important innovations have occurred in deep RL research, which show how the sample efficiency of deep RL can be dramatically increased. These methods mitigate the original demands made by deep RL for huge amounts of training data, effectively allowing deep RL to be fast. The emergence of these computational techniques revives deep RL as a candidate model of human learning and a source of insight for psychology and neuroscience. In the present review, we consider two key deep RL methods, which mitigate the sample efficiency problem: episodic deep RL and meta-reinforcement learning. We examine how these techniques enable fast deep RL, and consider their potential implications for psychology and neuroscience. Sources of slowness of deep RL A key starting point for considering techniques for fast RL is to examine why initial methods for deep RL were in fact so slow. Here, we describe two of the primary sources of sample inefficiency. At the end of the paper, we will circle back to examine how the constellations of issues described by these two concepts are in fact connected. The first source of slowness in deep RL is the requirement for incremental parameter adjustment. Initial deep RL methods (which are still very widely used in AI research) employed gradient descent to sculpt the connectivity of a deep neural network mapping from perceptual inputs to action outputs (see Box 1). As has been widely discussed not only in AI but also in psychology [26], the adjustments made during this form of learning must be small, in order to maximize generalization [27] and avoid overwriting the effects of earlier learning (an effect sometimes referred to as catastrophic interference). This demand for small step-sizes in learning is one source of slowness in the methods originally proposed for deep RL. A second source is weak inductive bias. A basic lesson of learning theory is that any learning procedure necessarily faces a bias-variance trade-off: The stronger the initial assumptions the learning procedure makes about the patterns to be learned | that is, the stronger the initial inductive bias of the learning procedure | the less data will be required for learning to be accomplished (assuming the initial inductive bias matches what's in the data!). A learning procedure with weak inductive bias will be able to master a wider range of patterns (greater variance), but will in general be less sample efficient [28]. In effect, strong inductive bias is what allows fast learning. A learning system that only considers a narrow range of hypotheses when interpreting incoming data will, perforce, hone in on the correct hypothesis more rapidly than a system with weaker inductive biases (again, assuming the correct hypothesis falls within that narrow initial range). Importantly, generic neural networks are extremely low-bias learning systems; they have many parameters (connection weights) and are capable of using these to fit a wide range of data. As dictated by the bias-variance trade-off, this means that neural networks, in the generic form employed in the first deep RL models (see Box 1) tend to be sample-inefficient, requiring large amounts of data to learn. Together, these two factors | incremental parameter adjustment and weak inductive bias | explain the slowness of first-generation deep RL models. However, subsequent research has made clear that both of these factors can be mitigated, allowing deep RL to proceed in a much more sample-efficient manner. In what follows, we consider two specific techniques, one of which confronts the incremental parameter adjustment problem, and the other of which confronts the problem of weak inductive bias. In addition to their implications within the AI field, both of these AI techniques bear suggestive links with psychology and neuroscience, as we shall detail. Episodic deep RL: Fast learning through episodic memory If incremental parameter adjustment is one source of slowness in deep RL, then one way to learn faster might be to avoid such incremental updating. Naively increasing the learning rate governing gradient descent optimization leads to the problem of catastrophic interference. However, recent research shows that there is another way to accomplish the same goal, which is to keep an explicit record of past events, and use this record directly as a point of reference in making new decisions. This idea, referred to as episodic RL [29, 30], parallels `non-parametric' approaches in machine learning [28] and resembles ìnstance-' or èxemplar-based' theories of learning in psychology [31, 32]. When a new situation is 2 encountered and a decision must be made concerning what action to take, the procedure is to compare an internal representation of the current situation with stored representations of past situations. The action chosen is then the one associated with the highest value, based on the outcomes of the past situations that are most similar to the present. When the internal state representation is computed by a multi-layer neural network, we refer to the resulting algorithm as \episodic deep RL". A more detailed explanation of the mechanics of episodic deep RL is presented in Box 2. In episodic deep RL, unlike the standard incremental approach, the information gained through each experienced event can be leveraged immediately to guide behavior. However, whereas episodic deep RL is able to go `fast' where earlier methods for deep RL went `slow,' there is a twist to this story: Episodic deep RL's fast learning depends critically on slow incremental learning. This is the gradual learning of the connection weights that allows the system to form useful internal representations or embeddings of each new observation. The format of these representations is itself learned through experience, using the same kind of incremental parameter updating that forms the backbone of standard deep RL.

Load more