The Basics of Dopamine and Reinforcement Learning Nathaniel D
Total Page:16
File Type:pdf, Size:1020Kb
CHAPTER 15 Value Learning through Reinforcement: The Basics of Dopamine and Reinforcement Learning Nathaniel D. Daw and Philippe N. Tobler OUTLINE Introduction 283 Temporal Difference Learning and the Dopamine Response 293 Learning: Prediction and Prediction Errors 283 From Error-Driven Learning to Choice 294 Functional Anatomy of Dopamine and Striatum 285 Conclusions 296 Responses of Dopamine Neurons to Outcomes 287 References 296 Sequential Predictions: From RescorlaÀWagner to Temporal Difference Learning 289 INTRODUCTION LEARNING: PREDICTION AND PREDICTION ERRORS This chapter provides an overview of reinforcement learning and temporal difference learning and relates An important problem facing decision makers is these topics to the firing properties of midbrain dopa- learning, by trial and error, which decisions to make, mine neurons. First, we review the RescorlaÀWagner so as best to obtain reward or to avoid punishment. In learning rule and basic learning phenomena, such as computer science, this problem is known as reinforce- blocking, which the rule explains. Then we introduce ment learning (RL; for a more thorough introduction, the basic functional anatomy of the dopamine system see Sutton and Barto, 1998), and algorithms to accom- and review studies that reveal a close correspondence plish it have been studied extensively. This chapter between responses emitted by dopamine neurons reviews the rather striking correspondence between and signals predicted by reinforcement learning. theoretical algorithms and evidence from neuroscience Finally, we introduce the generalization of the and psychology about how the brain solves the RL RescorlaÀWagner rule to sequential predictions as problem. The prime correspondence between these provided by temporal difference learning, and discuss two areas of research centers around the dopaminergic its application to phasic activation changes of dopa- neurons of the midbrain (reviews can also be found in mine neurons. Subsequent chapters in this section deal Glimcher, 2011; Niv, 2009; Schultz, 2007; Schultz et al., with more advanced topics in reinforcement learning 1997; Tobler, 2009). and presume that the reader is familiar with material To understand the role these neurons play, we first covered in this chapter. review research in learning, decision, and reward. We Neuroeconomics. DOI: http://dx.doi.org/10.1016/B978-0-12-416008-8.00015-2 283 © 2014 Elsevier Inc. All rights reserved. 284 15. VALUE LEARNING THROUGH REINFORCEMENT: THE BASICS OF DOPAMINE AND REINFORCEMENT LEARNING begin with evidence from classic experiments in psy- (A)1 (B) 0.15 chology using an experimental preparation À classical conditioning (also known as pavlovian conditioning) À which involves learning, but not decisions. This is an 0.1 0.5 important subcomponent of the full RL problem, Weight because choice between actions can be based on pre- 0.05 Prediction dicting how much reward they will produce. Pavlov (1927/1960) famously exposed dogs to 0 0 –20 –15 –10 –5 0 50 100 repeated pairings whereby an initially neutral stimu- delay (trials) trial lus, such as a bell, accompanied food, such as meat powder. He observed that following such training, the FIGURE 15.1 (A) The weights on rewards received at different dogs would salivate to the sound of the bell even if it past trials, according to the Rescorla/Wagner model. Weights decline was presented without the food, by virtue of the bell’s exponentially into the past, with a steepness that depends on the predictive relationship with the food. This conditioned learning rate parameter. (B) Simulation of Rescorla/Wagner model response offers a direct window on how organisms use learning about four different cues, which are reinforced (from top to bottom) 100%, 75%, 50%, and 25% of the time. Learning curves grow experience to learn to predict reward. Variations of to asymptote; for the stochastically rewarded stimuli, the prediction this basic experiment have been conducted with a vari- is noisy (driven by random patterns of reward and non-reward) ety of species, from molluscs to humans, using a vari- around the underlying average reward. ety of appetitive and aversive outcomes as rewards and a variety of anticipatory behaviors as responses, and many basic phenomena are widely preserved determines the size of the update step. Its interpreta- across this range of species. tion is clearer in an algebraically rearranged form of 5 2 α 1 α One popular view of the learning process that the update rule, Vk11ðskÞ ð1 ÞVkðskÞ rk. This emerges from these experiments is that learning in form reveals that the error-driven update accomplishes classical conditioning is based on a comparison a weighted average between the observed reward between what reward the organism experiences on a (with weight α) and the previous reward prediction particular trial, and what reward it had expected on (with weight (1 2 α)). Thus a larger learning rate the basis of its previous learning (Bush and Mosteller, updates the value prediction to look more like the cur- 1951). The difference between these two quantities is rent reward and a smaller learning rate relies more on known as a prediction error: if the difference is large, older estimates than on the current reward. predictions did not match observations, and there is a A related way to understand this model, resulting need for more learning to update those predictions. from further algebraic manipulation, is to realize that More formally, assume that an animal maintains a it computes a weighted running average of all rewards set of predictions of the reward associated with each received previously in the presence of the stimulus, stimulus, s, called V(s) (for value). Also assume that with the most recent reward weighted most heavily these predictions determine the animal’s conditioned and the weight for prior rewards declining exponen- response to whichever stimulus is observed. Then tially in their lag. Here, the learning rate can be equiv- upon observing stimulus sk (e.g., the bell on trial k) alently seen as controlling the steepness of the decay, and receiving a reward on that trial, rk, the prediction with higher learning rates producing averages more error is sharply weighted toward the most recent rewards. Such an exponential pattern (Figure 15.1a) is a key δ 5 r 2 V ðs Þð15:1Þ k k k k hallmark of this sort of error-driven updating, which As we will see below, this prediction error (with we will see verified in both behavioral and neural data further refinements) appears to be carried by dopami- later in this chapter. nergic neurons (Houk et al., 1995; Montague et al., Accordingly, applied to a simulated conditioning 1996; Schultz et al., 1997). experiment (in which a bell is repeatedly paired with The animal then updates the prediction in the direc- meat powder), the error-driven learning model tion of the prediction error, so as to reduce it. Thus, described above nudges the prediction toward the the predicted value on the next trial, k 1 1, of the stim- observation on each trial, producing a gradual, asymp- ulus sk is: toting learning curve that ultimately predicts the actual magnitude of the average reward (Figure 15.1b). If V 1 ðs Þ 5 V ðs Þ 1 αUδ ð15:2Þ k 1 k k k k rewards are stochastic (if meat powder is delivered (The value of stimuli that aren’t observed remains the based on the flip of a fair coin), then positive and neg- 5 same, i.e. Vk11(s) Vk(s), for all s6¼sk.) In Equation 15.2, ative prediction errors will be interleaved, and the net α is a learning rate parameter, between 0 and 1, which effect of all of these is that the prediction will climb NEUROECONOMICS FUNCTIONAL ANATOMY OF DOPAMINE AND STRIATUM 285 more sporadically to oscillate around the average stimulus (the bell) that itself had previously been reward (Figure 15.1b). trained to predict reward, then the animal can learn to A further question (Rescorla and Wagner, 1972)is salivate to the click, even though the click has never how animals learn stimulusÀreward (for example: itself been directly paired with reward. Such an effect lightÀmeat powder) relationships, when the experi- is not predicted under the RescorlaÀWagner model, ence with that stimulus is accompanied by other sti- because the prediction error on a trial with the click muli (the light is accompanied by a bell) that may and bell, but no reward, is negative. Before we treat themselves have previous reward associations. Kamin this in greater detail, let us first consider how dopa- (1969) found behaviorally that such previous learning mine neurons and their target structures process (about the bell) can attenuate (or block) new learning reward prediction errors. (about the light). Imagine that one of Pavlov’s dogs has learned that a bell predicts meat powder and reli- ably salivates upon presentation of the bell. Now a FUNCTIONAL ANATOMY OF light is presented simultaneously with the bell, and DOPAMINE AND STRIATUM both of them are followed by meat powder. When the light is tested on its own, the dog’s salivation to it is The majority of dopamine neurons reside in the reduced (e.g., relative to a control situation in which midbrain and form three cell groups, the retrorubral the bell was also novel). Previous learning about the nucleus (RRN; cell group A8 in the rat), the substantia bell has blocked learning about the light’s relationship nigra pars compacta (SNpc; A9), and the ventral teg- with reward. The blocking phenomenon suggests that mental area (VTA; A10). These cell groups are contigu- stimuli interact or compete with each other to explain ous, such that there are no clear boundaries between the same rewards. them. From these small nuclei, the dopamine neurons The RescorlaÀWagner (1972) model captures this send widespread, ascending projections to regions effect by specifying that when multiple stimuli are such as the striatum (caudate and putamen), the amyg- observed (light and tone), the animal makes a single dala and the (primarily frontal) cerebral cortex net prediction that isP the sum of all of their predic- (Figure 15.2).