Reinforcement Learning: the Good, the Bad and the Ugly Peter Dayana and Yael Nivb
Total Page:16
File Type:pdf, Size:1020Kb
CONEUR-580; NO OF PAGES 12 Available online at www.sciencedirect.com Reinforcement learning: The Good, The Bad and The Ugly Peter Dayana and Yael Nivb Reinforcement learning provides both qualitative and huge range of paradigms and systems. The literature in quantitative frameworks for understanding and modeling this area is by now extensive, and has been the topic of adaptive decision-making in the face of rewards and many recent reviews (including [5–9]). This is in addition punishments. Here we review the latest dispatches from the to rapidly accumulating literature on the partly related forefront of this field, and map out some of the territories where questions of optimal decision-making in situations invol- lie monsters. ving slowly amounting information, or social factors such Addresses as games [10–12]. Thus here, after providing a brief a UCL, United Kingdom sketch of the overall RL scheme for control (for a more b Psychology Department and Princeton Neuroscience Institute, extensive review, see [13]), we focus only on some of the Princeton University, United States many latest results relevant to RL and its neural instan- Corresponding authors: Dayan, Peter ([email protected]) and tiation. We categorize these recent findings into those Niv, Yael ([email protected]) that fit comfortably with, or flesh out, accepted notions (playfully, ‘The Good’), some new findings that are not as snugly accommodated, but suggest the need for exten- Current Opinion in Neurobiology 2008, 18:1–12 sions or modifications (‘The Bad’), and finally some key This review comes from a themed issue on areas whose relative neglect by the field is threatening to Cognitive neuroscience impede its further progress (‘The Ugly’). Edited by Read Montague and John Assad The reinforcement learning framework 0959-4388/$ – see front matter Decision-making environments are characterized by a # 2008 Elsevier Ltd. All rights reserved. few key concepts: a state space (states are such things as locations in a maze, the existence or absence of DOI 10.1016/j.conb.2008.08.003 different stimuli in an operant box or board positions in a game), a set of actions (directions of travel, presses on different levers, and moves on a board), and affectively Introduction important outcomes (finding cheese, obtaining water, and Reinforcement learning (RL) [1] studies the way that winning). Actions can move the decision-maker from one natural and artificial systems can learn to predict the con- state to another (i.e. induce state transitions) and they can sequences of and optimize their behavior in environments produce outcomes. The outcomes are assumed to have in which actions lead them from one state or situation to the numerical (positive or negative) utilities, which can next, and can also lead to rewards and punishments. Such change according to the motivational state of the environments arise in a wide range of fields, including decision-maker (e.g. food is less valuable to a satiated ethology, economics, psychology, and control theory. animal) or direct experimental manipulation (e.g. poison- Animals, from the most humble to the most immodest, ing). Typically, the decision-maker starts off not knowing face a range of such optimization problems [2],and,toan the rules of the environment (the transitions and out- apparently impressive extent, solve them effectively. RL, comes engendered by the actions), and has to learn or originally born out of mathematical psychology and oper- sample these from experience. ations research, provides qualitative and quantitative com- putational-level models of these solutions. In instrumental conditioning, animals learn to choose actions to obtain rewards and avoid punishments, or, However, the reason for this review is the increasing more generally to achieve goals. Various goals are realization that RL may offer more than just a compu- possible, such as optimizing the average rate of acqui- tational, ‘approximate ideal learner’ theory for affective sition of net rewards (i.e. rewards minus punishments), or decision-making. RL algorithms, such as the temporal some proxy for this such as the expected sum of future difference (TD) learning rule [3], appear to be directly rewards, where outcomes received in the far future are instantiated in neural mechanisms, such as the phasic discounted compared with outcomes received more activity of dopamine neurons [4]. That RL appears to be immediately. It is the long-term nature of these goals so transparently embedded has made it possible to use it that necessitates consideration not only of immediate in a much more immediate way to make hypotheses outcomes but also of state transitions, and makes choice about, and retrodictive and predictive interpretations interesting and difficult. In terms of RL, instrumental of, a wealth of behavioral and neural data collected in a conditioning concerns optimal choice, that is determining www.sciencedirect.com Current Opinion in Neurobiology 2008, 18:1–12 Please cite this article in press as: Dayan P, Niv Y, Reinforcement learning: The Good, The Bad and The Ugly, Curr Opin Neurobiol (2008), doi:10.1016/j.conb.2008.08.003 CONEUR-580; NO OF PAGES 12 2 Cognitive neuroscience an assignment of actions to states, also known as a policy have high values themselves). Model-free learning rules that optimizes the subject’s goals. such as the temporal difference (TD) rule [3] define any momentary inconsistency as a prediction error, and use it to By contrast with instrumental conditioning, Pavlovian (or specify rules for plasticity that allow learning of more classical) conditioning traditionally treats how subjects accurate values and decreased inconsistencies. Given learn to predict their fate in those cases in which they correct values, it is possible to improve the policy by cannot actually influence it. Indeed, although RL is preferring those actions that lead to higher utility out- primarily concerned with situations in which action selec- comes and higher valued states. Direct model-free tion is germane, such predictions play a major role in methods for improving policies without even acquiring assessing the effects of different actions, and thereby in values are also known [21]. optimizing policies. In Pavlovian conditioning, the pre- dictions also lead to responses. However, unlike the Model-free methods are statistically less efficient than flexible policies that are learned via instrumental con- model-based methods, because information from the ditioning, Pavlovian responses are hard-wired to the environment is combined with previous, and possibly nature and emotional valence of the outcomes. erroneous, estimates or beliefs about state values, rather than being used directly. The information is also stored RL methods can be divided into two broad classes, model- in scalar quantities from which specific knowledge about based and model-free, which perform optimization in very rewards or transitions cannot later be disentangled. As different ways (Box 1, [14]). Model-based RL uses experi- a result, these methods cannot adapt appropriately ence to construct an internal model of the transitions and quickly to changes in contingency and outcome utilities. immediate outcomes in the environment. Appropriate Based on the latter characteristic, model-free RL has actions are then chosen by searching or planning in this been suggested as a model of habitual actions [14,15],in world model. This is a statistically efficient way to use which areas such as the dorsolateral striatum and the experience, as each morsel of information from the amygdala are believed to play a key role [17,18].How- environment can be stored in a statistically faithful and ever, a far more direct link between model-free RL and computationally manipulable way. Provided that constant the workings of affective decision-making is apparent in replanning is possible, this allows action selection to be the findings that the phasic activity of dopamine readily adaptive to changes in the transition contingencies neurons during appetitive conditioning (and indeed and the utilities of the outcomes. This flexibility makes the fMRI BOLD signal in the ventral striatum of model-based RL suitable for supporting goal-directed humans, a key target of dopamine projections) has many actions, in the terms of Dickinson and Balleine [15]. of the quantitative characteristics of the TD prediction For instance, in model-based RL, performance of actions error that is key to learning model-free values [4,22–24]. leading to rewards whose utilities have decreased is It is these latter results that underpin the bulk of neural immediately diminished. Via this identification and other RL. findings, the behavioral neuroscience of such goal- directed actions suggests a key role in model-based RL ‘The Good’: new findings in neural RL (or at least in its components such as outcome evaluation) Daw et al. [5] sketched a framework very similar to this, for the dorsomedial striatum (or its primate homologue, and reviewed the then current literature which pertained the caudate nucleus), prelimbic prefrontal cortex, the to it. Our first goal is to update this analysis of the orbitofrontal cortex, the medial prefrontal cortex,1 and literature. In particular, courtesy of a wealth of exper- parts of the amygdala [9,17–20]. iments, just two years later we now know much more about the functional organization of RL systems in the Model-free RL, on the other hand, uses experience to brain, the pathways influencing the computation of pre- learn directly one or both of two simpler quantities (state/ diction errors, and time-discounting. A substantial frac- action values or policies) which can achieve the same tion of this work involves mapping the extensive findings optimal behavior but without estimation or use of a world from rodents and primates onto the human brain, largely model. Given a policy, a state has a value, defined in terms using innovative experimental designs while measuring of the future utility that is expected to accrue starting the fMRI BOLD signal.