Learning, Action, Inference and 455

Learning, Action, Inference and Neuromodulation P Dayan, University College London, London, UK serotonin play a special role in positive and negative N D Daw, New York University, New York, NY, USA affect, with norepinephrine and acetylcholine being Y Niv, Princeton University, Princeton, NJ, USA particularly involved with uncertainty. Further, ortho- ã 2009 Elsevier Ltd. All rights reserved. gonal to this distinction, the neuromodulators also have two main classes of effects on neural process- ing: (1) influencing neural plasticity and learning and (2) regulating different sources of input to neurons Introduction and controlling competition between networks of Neuromodulators, including , norepineph- neurons. rine, acetylcholine, and serotonin, exert a wide and This article first provides a somewhat general char- diverse collection of effects on a broad range of corti- acterization of issues surrounding learning and then cal and subcortical target structures. They are also briefly discusses immediate effects of the neuromodu- critically implicated in many aspects of normal beha- lators on inference and regulation. vior and both neurological and psychiatric condi- tions. With the exception of acetylcholine, they are Learning synthesized by neurons whose somas lie in relatively small subcortical regions such as the ventral tegmen- Long-term synaptic plasticity, that is, persistent facil- tal area and (for itation and depression of the efficacies of synapses, dopamine), the (for norepinephrine), can be characterized at many different levels of biologi- and the dorsal and median (for serotonin). cal realism and according to many different theoretical However, these neurons can have massive axonal abstractions. Unfortunately, experimental difficulties arbors, innervating large areas. The activity of the have hindered efforts to understand completely the neuromodulatory neurons and the release of their range of effects of neuromodulators on plasticity; associated neuromodulators (which is subject to many therefore this article proceeds at a more abstract factors other than the momentary spiking rates) onto level appropriate to the relevant normative theories. the bewildering diversity of their pre- and postsynap- The central concept for the learning associated tic receptors exhibit a large number of incompletely with neuromodulators is prediction. Consider a very understood temporal and spatial regularities. simple experiment comprising many trials, in each There are obviously huge gaps in our knowledge of which a particular image may or may not be pre- about neuromodulators, and there are doubtless many sented and a drop of pleasant-tasting fruit juice may as yet undiscovered complexities. However, in some or may not be delivered (regardless of what the sub- key respects, they appear to admit relatively simple jects do). If the juice is provided more often with than theoretical characterizations associated with learning without the image, then subjects learn to expect or predictions of affectively important outcomes such as predict the delivery of the juice when the image is rewards and punishments, learning to choose actions presented. Such predictions can be measured via that optimize those outcomes, and regulating and behaviors such as anticipatory salivation or licking, controlling certain general aspects of information pro- which are thought to be triggered more or less directly cessing. These are the most basic functions associated and automatically by the prediction of the reward. with the survival of mobile, decision-making organ- This is a very simple instance of what is known as isms, and it turns out that a very helpful normative classical or Pavlovian conditioning. The image is underpinning for these theories can be derived from called a conditioned stimulus, the juice is known as statistics, operations research, and engineering. It an unconditioned stimulus, and the predictive beha- should be acknowledged at the outset, however, that viors are termed conditioned responses. these theories ignore many important functions of the Trial-Based Prediction neuromodulators, such as their role in regulating sleep–wake cycles. Nonetheless, they are influential Reinforcement learning theories formalize this predic- in organizing large bodies of experimental results tion learning by positing that the presence of the image and also in compelling new empirical approaches. on trial n is coded by the activity of a unit xn ¼ 1 (with Under these theoretical views, neuromodulators xn ¼ 0 if the image was not presented). This unit report on two main quantities: (1) information makes a connection, via an abstract synapse with associated with positive and negative outcomes and strength wn, to a neuron. This neuron also receives (2) information associated with uncertainty. It seems, information about the delivery (rn ¼ 1) or nondelivery at least to a crude approximation, that dopamine and (rn ¼ 0) of juice on trial n. The value rn would be scaled 456 Learning, Action, Inference and Neuromodulation on the basis of the appetitive utility of the juice. The where they were invented to describe the develop- idea is that the synaptic connection should be mod- ment of behavioral responding in conditioning ified according to a learning rule, such that the pres- experiments. Their relationship both with the neural ence of the stimulus will predict the occurrence of data on the actual activities and concentrations of the the reward. neuromodulators and with normative computational One rather general form of such a learning rule is ideas came substantially later. One of the most important psychological learning nþ1 ¼ n þ nn n ½ w w x 1 rules is called the Rescorla–Wagner rule. This is based which indicates how wn changes from one trial to the on the idea that the net prediction of juice on trial n is n ¼ n n next. This update is a function of three quantities: just the total presynaptic input, here v x w , and derives learning from the prediction error, the n a (learning rate): scales the size of the change to the difference between the juice that is provided and the synapse. juice that is predicted: dn (prediction error): incorporates information about n n n n unpredicted or incorrectly predicted outcomes and ¼ r w x ½2 typically controls the direction of change xn (synaptic eligibility): here, limits synaptic change ¼ rn vn ½3 to those trials in which the image is presented n In its simplest form, the Rescorla–Wagner rule (x ¼ 1). also suggests that the learning rate an ¼ a is constant. Different reinforcement learning theories posit dif- Figure 1(a) shows the operation of this rule, indicating ferent forms for these terms. Many early ideas in this how the values of rn and xn specify wn over a whole set area came from the field of mathematical psychology, of trials.

Outcome r =0 r =1 1 =0 Weight V

0 R Prediction Image =1 V Juice R 5101520 (No R)

0.5 s ab0 1s c Figure 1 Dopamine and learning rules. (a) The course of learning using the Rescorla–Wagner rule. The top plot shows the evolution of n the weight w over the course of the 20 trials with an image and juice being provided as in the bottom plot. The learning rate a = 0.4. The stochastic nature of the learning to a level around the asymptotic value of 0.8 is apparent. (b) Evidence for a prediction error at the time of reward expectancy is shown in the activity of dopamine cells. The four boxes show rasters and accumulated histograms of the activity of identified dopamine neurons in macaque monkeys (during 20 trials), where the reward is a small drop of juice (provided at the time marked by R). The plots are in accord with the prediction error of eqn [3]. (c) The activity of dopamine neurons in response to the presentation (marked by vertical lines) of a conditioned stimulus which either does not (upper) or does (lower) predict the future delivery of juice. These are consistent only with temporal difference learning and not the Rescorla–Wagner rule. (b) Adapted from Schultz W, Dayan P, and Montague PR (1997) A neural substrate of prediction and reward. Science 275: 1593–1599. (c) Reprinted, with permission, from the Annual Review of Physiology, Volume 57 ã 2006 by Annual Reviews www.annualreviews.org. Learning, Action, Inference and Neuromodulation 457

Although extremely simple, this rule has quite a Rescorla–Wagner rule is critically flawed) or when number of desirable computational properties and there are multiple predictive images or stimuli, how an excellent normative basis in engineering (where it prediction learning and action learning are coupled, is more usually known as the delta rule). Most impor- and how punishments (e.g., small electric shocks) are tant, it leads to correct predictions, at least on aver- processed. There are also some data about the neural age. In particular, if the juice is always provided with structures involved in the various components of this the image, then the weight wn will come to take the learning rule (i.e., xn and wn), notably implicating value 1; if the juice is provided on half the trials with nuclei in the amygdala and the orbitofrontal cortex, the image, wn will take the value of 0.5, and if the together with their connections to the nucleus accum- juice is never provided, then wn will become 0. The bens, all structures that receive a strong dopamine rule also has a decent evidentiary trail in psychology, projection. capturing many empirical phenomena surrounding the acquisition and extinction of conditioned beha- Temporal Prediction viors in variants of the simple Pavlovian conditioning experiment discussed here. The use of the word prediction hints at the critical Somewhat remarkably, the error signal dn in this role that time plays in conditioning experiments. form of the rule also appears to capture well some, Broadly speaking, to be a good predictor, it is impor- but crucially not all, of the characteristics of the tant that the image precede (rather than coincide with phasic spiking activity of dopamine cells as recorded or follow) the juice. This leads to both conceptual and electrophysiologically in the during a learn- psychological/neural problems. The conceptual prob- ing task like this. Figure 1(b) shows an example of this lem is that it is necessary to change the goal of predic- match. The figure shows a raster plot of the activity of tion – there is no point in predicting the amount of a dopamine neuron in an awake, behaving macaque juice that will be provided at the same time as the monkey at the time of the delivery or nondelivery of image, since they are not delivered at the same time; an unexpected or expected reward. If, indeed, the rather, it is necessary to predict the amount that will prediction error occasioned by delivery or nondeliv- come in the future. An important step in the field was ery of juice is reflected by phasic bursts of activity the suggestion by Richard Sutton and Andrew Barto above or below the slow, irregular baseline of neural that participants should try to predict a long-run firing, then from eqn [3], one should expect the fol- measure of the reward, such as its sum over a whole lowing: suprabaseline activity for the delivery of the trial. Achieving this requires a different prediction unexpected reward (dn ¼ 10 ¼ 1; for instance, at the error from that in eqn [3]. It also turns out that this outset of learning, when the participant cannot make new prediction error accounts for more features of an accurate prediction); infrabaseline activity for the the activity of dopamine neurons, specifically their nondelivery of the expected reward (dn ¼ 01 ¼ 1); firing patterns at the time of the image presentation. and, crucially, baseline activity when an expected To understand this new prediction error, one needs reward is provided (dn ¼ 11 ¼ 0). We would also a new variable t to count time within each trial. For expect baseline activity for the nondelivery of a non- instance, the image might be presented at time t ¼ 2, expected reward (dn ¼ 00 ¼ 0). The figure, although the juice at time t ¼ 5, with the trial ending at t ¼ T. not showing data collected in exactly this manner, Now the reward on trial n is indexed by time rn(t) shows just these expectations to be fulfilled. Note, with, for instance, rn(5) ¼ 1 on trials on which the though, that dopamine activity does not offer a juice is given. The new task of the participants is to completely linear code for dn. The potential fidelity of predict vn(t) at time t within a trial as the sum of all the coding of negative prediction errors is limited the future rewards that will be provided on average by the fairly low baseline activity of the neurons; during that trial after t. Assuming for the moment there is also some evidence that the coding is that rewards are delivered deterministically, this adaptive, moulding itself to the appropriate statistical means that the following relationship should hold ▲ contingencies. (for which is used the symbol ¼ ):

In sum, at the time at which reward delivery is n ▲ n n n n v ðtÞ ¼ r ðtÞþr ðt þ 1Þþr ðt þ 2Þþ...þ r ðTÞ½4 expected, dopamine neurons appear to convey a form of prediction error for the delivery of reward, The obvious trouble with this definition as a goal a signal that is exactly appropriate for training pre- for prediction is that the right-hand side, measuring dictions (as in eqn [1]). This finding leads to a large the actual delivery of juice, cannot be assessed until range of questions about what happens when the the end of the trial. The solution to this problem, temporal relationship between the image and the juice which underlies the new prediction error, is to is manipulated (the data here showing that the observe that the sum can be split into its first and 458 Learning, Action, Inference and Neuromodulation subsequent terms: notably, allowing a predictor of reward to act like a

n ▲ n n n n surrogate reward itself, an effect known in psychology v ðtÞ ¼ r ðtÞþ½r ðt þ 1Þþr ðt þ 2Þþ:::þ r ðTÞ ½5 as conditioned reinforcement. so if the prediction at time t þ 1, vn(t þ 1), is, as it The neural problem associated with temporal pre- should be, equal to rn(t þ 1) þ rn(t þ 2) þ... rn(T), diction is exactly that the predictors need to be able to then, count time; in this case, the time since the image was presented. The well-timed depression in the activity n ▲ n n v ðtÞ ¼ r ðtÞþv ðt þ 1Þ½6 of the dopamine neurons when an expected reward is not delivered is clear evidence that exactly this The new prediction error becomes the discrepancy happens (and incidentally argues strongly against between the left- and right-hand sides and can be certain other construals of the dopamine activity, computed after waiting just one timestep, without for instance that it flags all ‘salient’ events without waiting for the end of the trial: regard to their appetitive or aversive valence). How- nðtÞ¼rnðtÞþvnðt þ 1ÞvnðtÞ½7 ever, although there are various suggestions about the involvement of the , the hippocampus, This is called the temporal difference prediction the , and the parietal and frontal cortices in error because of the key role played by the change in counting time, the exact nature of this signal is not the prediction over consecutive timesteps (vn(t þ 1) well understood. vn(t)). Even though, during learning, it cannot be guaranteed that vn(t þ 1) ¼ rn(t þ 1) þ rn(t þ 2) þ Multiple Predictive Stimuli ...þ rn(T), it turns out that a learning rule like eqn So far only the case of just one potentially predictive [1], using dn(t) to update the weights associated with image has been considered. Now consider the case in the prediction made at time t will also ultimately which there are multiple predictive stimuli. It turns arrive at the correct values. Learning of this sort is out that only the temporally simple case (associated said to involve bootstrapping because the predictions with the Rescorla–Wagner rule) has to be treated, at one time are learned partly on the basis of predic- since the issues about time t within a trial are orthog- tions at the subsequent time. onal to those about the multiple stimuli. The most important practical difference between The Rescorla–Wagner version of the learning rule the previous and present prediction errors (i.e., (eqn [1]) was designed to address psychological phe- between eqns [3] and [7]) is the value of the new nomena that arise when there are multiple predic- prediction error associated with the first presenta- tive stimuli. The main idea is to elaborate it to the tion of the image (in this case, dn(1) ¼ rn(1) þ vn(2) following: vn(1) ¼ 0 þ vn(2)0 ¼ vn(2)). If the image is always nþ1 ¼ n þ nn n ½ followed by juice, then, once learning is complete, the wi wi i xi 8 prediction changes from vn(1) ¼ 0tovn(2) ¼ 1 when n the image is provided, resulting in a transient predic- where there are now multiple stimuli xi , each with its n tion error at time t ¼ 1, dn(1) ¼ 1. The prediction then own prediction weight wi , but, critically, still with dn remains at 1 until the reward is actually delivered, so only one global prediction error . there is no difference between vn(t)andvn(t þ 1), and The Rescorla–Wagner rule involves the net predic- n therefore dn(t) ¼ 0 for all t preceding the time of tion v of juice, defined as the sum of the predictions reward. At the reward time, (here t ¼ 5), no further associated with each stimulus: n ¼ reward is expected in the trial, so v (6) 0, and the vn ¼ xn wn þ xn wn þ ::: ½9 error from eqn [7] is dn(5) ¼ rn(5)vn(5). Hence, this 1 1 2 2 prediction error behaves similarly to that in the If, perhaps as a result of prior learning, one of the Rescorla–Wagner rule for all the situations shown in stimuli present on a trial already predicts the reward Figure 1(b). correctly, then the prediction error (of eqn [3]) will be Figure 1(c) shows that the activity of dopamine 0(dn ¼ 0), and therefore the synapses associated with neurons at the time of the image follows exactly the any other stimuli which are also present, including those pattern expected from eqn [7]. It is as if the activity that do not yet predict the reward, will not change. This that at the beginning of training happens at the time of effect is known in the psychological literature as block- the unexpected reward moves backward in time ing. The ample behavioral evidence for blocking was across the trial so that it is instead initiated by taken as an early sign that learning is indeed driven by the earlier reliable predictor of that reward, namely, prediction errors rather than, for instance, simply cor- the image. The transient activity at the time of the relations. There is also some evidence for blocking in the image can play a number of important functions, activities of dopamine neurons. Learning, Action, Inference and Neuromodulation 459

Stimulus multiplicity could also affect the other Although the Kalman filter certainly captures components of the learning rule. In particular, a num- something of the flavor of the Pearce–Hall learning ber of competing learning rules that have been sug- rule, its detailed predictions and putative cholinergic gested on the basis of psychological data suggest that substrates have yet to be fully tested, either behavior- there are stimulus-specific predictions (and prediction ally or neurally. Norepinephrine also seems to exert errors) and invoke stimulus-specific (and changing) rather general effects on the overall speed of learning, n learning rates i . These rules treat the learning rates particularly in circumstances in which there are gross as associabilities, reflecting a measure of the attention changes to an environment. One theoretical account that the learners lavish on the stimuli. The most cele- of this suggests that norepinephrine reports a differ- brated of these accounts, due to John Pearce and ent form of uncertainty, that associated with the Geoffrey Hall, suggests that stimuli should be associ- whole context. This is coupled to learning since con- able to the degree that they have had surprising con- textual change typically induces a need for gross revi- sequences. In this view, multiple stimuli can interact sion or fresh acquisition of predictions. There are two with one another to produce learning phenomena final issues associated with multiple stimuli. First, the such as blocking at least partly through competition task of combining the predictions of reward made by for these associabilities. This account has attracted multiple stimuli can be considered an example of the many experimental and theoretical studies. more general statistical problem of combining multi- Experimentally, Peter Holland, Michela Gallagher, ple sources of evidence. There is ample psychological and their colleagues have collected evidence from data suggesting that, as in most statistical accounts, lesion, pharmacological, and behavioral studies sug- this sort of combination is typically not additive (as gesting that associabilities are realized (at least in in eqn [9]) but rather is competitive, with the pre- rats) through a pathway involving the central nucleus dictions made by different stimuli being averaged of the amygdala, attentional and control regions of together and the prediction of the most reliable (least the cortex, and, notably for the topic of this article, uncertain) predictor being weighted most heavily. How various parts of the acetylcholine system. Oddly, important this is for prediction learning, and indeed there appear to be different anatomical substrates what its neural substrate is, are presently unclear. for increases and decreases in stimulus associability. Second, the notion that each stimulus has its own From a theoretical perspective, the sort of surprise independent, binary input feature xi is too simple, inherent in Pearce and Hall’s suggestion can be for- particularly if the stimulus representations associated mulated as a particular form of uncertainty (a fact with prediction learning are those in refined areas that, together with Holland and Gallagher’s six results such as prefrontal cortex. Rather, sophisticated corti- and others, led to the suggestion that such uncertainty cal (and perhaps hippocampal) mechanisms will play might be reported by acetylcholine). Normative the- a critical role in creating stimulus representations that ories of prediction learning arising in the fields of are appropriately tailored to sensory experience. statistics and engineering actually suggest learning rules rather like eqn [8] and provide a precise charac- n Action Learning terization of what factors should control i . The best known and simplest of these theories, called the Kal- So far, this article has considered prediction learning man filter, suggests that the learning rates act to divide but not the selection of actions based on such predic- up the impact of the prediction error dn on the weights tions. Although the prediction of upcoming reward n wi for each of the stimuli such that the stimuli whose can in itself directly trigger conditioned behavioral weights are least well known (i.e., most uncertain) are responses, such as approach, or the salivation of most susceptible to learning. That is, the more uncer- Pavlov’s dogs, these represent a fairly limited portion tain the prediction made by one stimulus (for instance, of the behavioral repertoire. Animals can also learn because it has been observed only a few times), the to take arbitrary actions (such as pressing a lever) more it is deemed responsible for any nonzero predic- in order to obtain rewards that are contingent on n n tion error d , and so the more its prediction w i such actions. This learning is known as instrumen- changes. Variants of this theory also link a stimulus’s tal conditioning, and its neural substrates have actu- uncertainty to the degree of surprise that has accom- ally been more extensively investigated than those panied it, as for associability in the Pearce–Hall rule. of Pavlovian conditioning. This area has revealed Additionally, in noisy environments in which stimuli some important complexities; for instance, in rats, an and rewards are inherently unreliable, the learner instrumentally conditioned behavior such as a lever should expect substantial prediction errors even if its press can apparently arise through either of at least predictions are as good as can be. This should have the two neurally and behaviorally dissociable pathways. effect of reducing all the learning rates. One of these involves at least the prelimbic prefrontal 460 Learning, Action, Inference and Neuromodulation cortex and the dorsomedial striatum, may be some- such as the frontal pole and the intraparietal sulcus in what independent of dopaminergic neuromodulation, regulating exploration on a trial-by-trial basis. and produces actions that can behaviorally be shown The recent evidence from the activity of dopamine to be sensitive to the particular reward (such as food) neurons during choice between several possibly rewar- expected. Such actions are therefore known as ‘goal- ding actions favors a particular temporal-difference directed.’ The other pathway involves the dorsolat- rule for learning the Q-values. Just as the prediction eral striatum and the infralimbic prefrontal cortex, is error dn(t) in eqn [7] is based on the difference between strongly dependent on dopamine, and produces so- two successive predictions vn(t þ 1) and vn(t), this new called ‘habitual’ actions. These are behaviorally char- prediction error is based on the difference between acterized by their insensitivity to the particular reward two successive predictions, Qn(t þ 1,a(t þ 1)) and expected; for instance, a rat may habitually press a food Qn(t, a(t)), depending on the actions that are actually delivery lever even if it is not hungry. executed at times t and t þ 1: There is both behavioral and neural evidence that nðtÞ¼rnðtÞþQnðt þ 1; aðt þ 1ÞÞ Qnðt; aðtÞÞ ½11 the habitual action system is based on prediction learning of almost exactly the sort described above. This prediction error can then be used at the heart This is consonant with a much larger body of of a learning rule just like that in eqn [1], with the evidence that dopamine is involved in action choice; prediction Qn(t,a(t)) being updated on the basis of besides habitual instrumental action, the neuromodu- this prediction error. lator is associated with addictive drugs, with self- stimulation experiments in which animals will work to receive electrical stimulation in some brain areas, Appetitive and Aversive Predictions and with motor pathologies such as Parkinson’s dis- ease. One particular view of action learning has This article has discussed appetitive predictions and recently received strong support from data on the the role of dopamine. However, in many circum- activity of dopamine neurons in a task which involves stances, it is actually even more important to predict monkeys’ making choices between alternatives asso- aversive outcomes. Further, animals can certainly ciated with different probabilities of reward. Under learn to execute particular actions to avoid punish- this scheme, the predictions should be not of the ments and can titrate their actions to balance the reward that will accrue after time t (written as vn(t) benefits and costs of rewards and punishments asso- in eqn [4]), but rather of those that will accrue after ciated with particular actions. The difference between time t if the learner chooses a particular action (say, a appetitive and aversive predictions and outcomes is (t)) at time t. These so-called Q-values (written as completely transparent to the normative theories, Q(t,a(t)) were originally suggested by Christopher which happily consider them to live at positive and Watkins. If such predictions were known perfectly, negative segments of a single dimension of value. action selection could ensue simply by choosing the However, there is a huge body of psychological and action whose value is greatest, that is, the one predic- neural data showing that animals treat them very tive of the most reward. For a number of reasons, it differently. There are also many open controversies can be prudent to choose somewhat randomly, in the literature, including a very active debate about assigning a greater chance to actions with greater the role of dopamine in aversively motivated beha- Q-values: viors. Aversive predictions play a key role in psychi- atric conditions such as anxiety disorders (whose eQðt;aÞ PðaðtÞ¼aÞ¼P ½10 basis may be nonfactual predictions of threat) and Qðt;bÞ b e paranoid . These various data have yet to be encompassed Here, b > 0 controls how fierce the competition is; in the sort of integrated computational, psychologi- the larger b, the more likely the action with the largest cal, and neurobiological account outlined here for Q-value will be chosen. One reason to employ such a appetitive prediction learning and action choice. probabilistic rule is to ensure exploration of unfamiliar A popular psychological idea is that separate, but alternatives and to avoid behavioral rigidity. It has been interacting, opponent systems are involved in training suggested that norepinephrine regulates exploration by appetitive and aversive predictions. One possibility, controlling b, coupling the long-term success of a suggested by William Deakin and Frederico Graeff behavioral policy to reduced exploration and long- and still somewhat speculative, is that one part of term failure to enhanced exploration. There is also the serotonin system acts as the aversive opponent starting to be some evidence (from functional imaging to dopamine. The data supporting this idea mainly in humans) about the involvement of neural structures originate in studies of the behavior of animals under Learning, Action, Inference and Neuromodulation 461 pharmacological manipulation of serotonergic and benefits of vigorous responding. The model argues dopaminergic function. Full tests of this hypothesis that tonic levels of dopamine report the average rate using electrophysiological recordings of serotonergic of reward and act as a form of opportunity cost which neurons are only just beginning. determines the optimal response vigor. The greater the opportunity cost, the greater the expectation of Immediate Effects of Neuromodulators reward per unit time, and so the faster the participant must act to achieve this. This also implicates tonic This article has emphasized the effects of neuromo- dopamine in mediating the effects of motivation on dulators on learning. However, a more traditional response vigor. For instance, satiety, reducing the view of neuromodulation, established most firmly in value of each reward, would decrease the net average studies of central pattern generation in invertebrates, rate of reward and thus result in lower tonic dopa- is that it regulates connectivity and excitability over mine levels and more-sluggish responding. the short term, allowing multiple functional networks Finally, if acetylcholine reports on the learner’s to share a single anatomical substrate. Even in the uncertainty about aspects of its environment, then context of the sorts of conditioning functions dis- from a statistical viewpoint, it should influence not cussed here, there is indeed evidence that the neuro- only learning but also the interpretation of sensory modulators exert short-term effects not mediated input – in statistical terms, inference. In particular, if through learning. A few, in particular, have been uncertainty is high, then ‘top-down’ information most extensively investigated, including the impact about prior beliefs and expectations should be less of dopamine on working memory and on the vigor trusted than ‘bottom-up’ information from sensation. of responding, and the role of acetylcholine and nor- This involves just the sort of regulation of connectivity epinephrine in regulating different sources of infor- that is implied by invertebrate neuromodulation. mation associated with making predictions. There is indeed substantial evidence that acetylcho- There is quite some evidence that dopamine, like line can boost the impact of bottom-up information other neuromodulators, can immediately influence compared with top-down information. Concomitantly, the excitability of neurons, often characterized by if norepinephrine reports on contextual change, then the gain of their input–output functions. If the neurons it should also have the effect of regulating the use of are connected in competitive networks (for instance, top-down information. subserving the choice between multiple possible actions), then changing the gain (a bit like changing Summary b in eqn [10]) can change the dynamics and stability of the competition. The effect of this on action choice Neuromodulators are involved in some of the most has been studied in some detail by Jonathan Cohen basic adaptive functions of an organism, notably how and his colleagues. In prefrontal cortex, a similar it behaves and learns to behave appropriately in light effect of dopamine has been suggested as being impor- of rewards, punishments, and other information. tant for gating the short-term storage of information Dopamine and, putatively, serotonin appear to report (such as the fact that an image predicting reward was prediction errors for future reward and punishment, recently presented). As shown, this sort of storage to influence the acquisition of predictions and appro- plays a critical role in making appropriate predictions priate habitual actions, and to effect motivationally and therefore also in choosing appropriate actions. appropriate responding. Acetylcholine and norepi- Dopamine also appears to control the vigor or the nephrine seem to report on forms of uncertainty to force of behavioral responses. For instance, the overt control the allocation of learning among different effects of dopaminergic agonists such as amphet- possible predictors, the overall speed of learning, amine are hyperactivity; antagonists produce lethargy and the way that information from sensation is and affect the propensity to choose actions that have integrated with information based on past experience high costs. These effects also appear to be immediate, in an environment. in that learning is not required for them to be exhib- Accounts adapted from statistics, operations ited. The reinforcement learning theories so far research, computer science, and engineering offer a described do not encompass such phenomena, since normative underpinning for both the behavioral data they are designed to address choice among a discrete and at least some key aspects of the neural substrates. set of candidate actions and lack any analogue of a See also: Acetylcholine Neurotransmission in CNS; Basal response vigor. A recent extension of the standard Ganglia: Acetylcholine Interactions and Behavior; Basal reinforcement learning picture accommodates free- Ganglia: Habit; Conditioning: Theories; Delayed operant tasks, in which subjects choose the latency Reinforcement: Neuroscience; Dopamine; Dopamine – of each of their responses, balancing the costs and CNS Pathways and Neurophysiology; Neuroendocrine 462 Learning, Action, Inference and Neuromodulation

Control of Energy Balance (Central Circuits/ Dayan P, Kakade S, and Montague PR (2000) Learning and selec- Mechanisms); Neuromodulation; Neutrotransmission and tive attention. Nature Neuroscience 3: 1218–1223. Neuromodulation: Acetylcholine; Operant Conditioning of Deakin JFW and Graeff FG (1991) 5-HT and mechanisms of Reflexes; Prediction Errors in Neural Processing: defence. Journal of Psychopharmacology 5: 305–316. Imaging in Humans; Procedural Learning: Classical Dickinson A (1980) Contemporary Animal Learning Theory. Cambridge, UK: Cambridge University Press. Conditioning; Serotonin (5-Hydroxytryptamine; 5-HT): Doya K (2000) Meta-learning, neuromodulation and emotion. In: CNS Pathways and Neurophysiology; Sleep and Sleep Hatano G, Okada N, and Tanabe H (eds.) Affective Minds, pp. States: Cytokines and Neuromodulation. 101–104. Amsterdam: Elsevier Science. Holland PC and Gallagher M (1999) Amygdala circuitry in atten- tional and representational processes. Trends in Cognitive Further Reading Sciences 3: 65–73. Pearce JM and Hall G (1980) A model for Pavlovian learning: Aston-Jones G and Cohen JD (2005) An integrative theory of locus Variation in the effectiveness of conditioned but not uncondi- coeruleus-norepinephrine function: Adaptive gain and optimal tioned stimuli. Psychological Review 87: 532–552. performance. Annual Reviews of Neuroscience 28: 403–450. Schultz W, Dayan P, and Montague PR (1997) A neural substrate of Braver TS, Barch DM, and Cohen JD (1999) Cognition and prediction and reward. Science 275: 1593–1599. control in schizophrenia: A computational model of dopamine Sutton RS and Barto AG (1998) Reinforcement Learning. and prefrontal function. Biological Psychiatry 46: 312–328. Cambridge, MA: MIT Press. Daw ND, Kakade S, and Dayan P (2002) Opponent interactions bet- Yu AJ and Dayan P (2005) Uncertainty, neuromodulation, and ween serotonin and dopamine. Neural Networks 15: 603–616. attention. Neuron 46: 481–492.